While not natively suppported Postgres 9.5 extensions exist for HyperLogLog, and Columnar storage. Here I benchmark how they compare on some simple cases for a billion integers in 250 million rows.

postgres

For perspective that would roughly translate to 8 rows per second of 4 integers each accumulated over the time span of one year.

Hardware

4GB RAM 80GB SSD Dual Core Cloud VPS

Environment

This is based on a fresh ubuntu 16.04 LTS install.

> apt-get install build-essential postgresql postgresql-contrib postgresql-server-dev-9.5 protobuf-c-compiler libprotobuf-c0-dev

An extension is required for Postgres hyperloglog support.

> git clone https://github.com/citusdata/postgresql-hll.git
Cloning into 'postgresql-hll' ...
> cd postgresql-hll
root@hlltest:~/postgresql-hll#
> make install
/bin/mkdir -p '/usr/lib/postgresql/9.5/lib' ..

Similarly column store support requires another extension.

> git clone https://github.com/citusdata/cstore_fdw.git
Cloning into 'cstore_fdw' ...
> cd cstore_fdw
root@hlltest:~/cstore_fdw#
> PATH=/usr/local/pgsql/bin/:$PATH make
gcc -Wall -Wmissing-prototype ...
> PATH=/usr/local/pgsql/bin/:$PATH make install
/bin/mkdir -p '/usr/lib/postgresql/9.5/lib' ...
> vim /etc/postgresql/9.5/main/postgresql.conf
shared_preload_libraries = 'cstore_fdw' # (change requires restart)
> systemctl restart postgresql

Once the environment is setup we switch to a Postgres REPL and enable the probabalistic, and column store extensions.

> sudo -i -u postgres
postgres@hlltest:~$
> psql
postgres=#
> CREATE EXTENSION hll;
CREATE EXTENSION
> CREATE EXTENSION cstore_fdw;
CREATE EXTENSION

Every benchmark was run with two different configurations. The “low memory” case will be a totally default configuration. The “high memory” case is the same with work_mem and shared_buffers increased.

# default
shared_buffers: 128M
work_mem: 4MB

# high memory
work_mem: 256MB
shared_buffers: 1GB

Test Data

The test table will be 4 columns of integers. This should lead to clear differences when working with the column vs row store. The Postgres integer type is 4 bytes, bringing the total data size of each row to 16 bytes.

> CREATE TABLE numbers(a integer, b integer, c integer, d integer);
CREATE TABLE
> INSERT INTO numbers
  SELECT n AS a, n AS b, n AS C, n AS d
  FROM generate_series(1, 5) AS n;
INSERT 0 5

> SELECT * FROM numbers;
 a | b | c | d 
---|---|---|---
 1 | 1 | 1 | 1
 2 | 2 | 2 | 2
 3 | 3 | 3 | 3
 4 | 4 | 4 | 4
 5 | 5 | 5 | 5
(5 rows)

> DELETE FROM numbers;
DELETE 5

> INSERT ..
INSERT 0 250000000

For the test data each row will contain increasing numbers leading to each value being unique. We insert 250 million rows AKA 1 billion integers.

> CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;
CREATE SERVER

> CREATE FOREIGN TABLE numberscol(a integer, b integer, c integer, d integer) SERVER cstore_server;
CREATE FOREIGN TABLE

> INSERT INTO numberscol SELECT * FROM numbers;
INSERT 0 250000000

The column store is a totally different table despite containing the same data. We copy the standard row store into the column store.

> VACUUM ANALYZE;
VACUUM

> \timing
timing on

Benchmarks

The first test is a naive row count on the row store table.

> SELECT COUNT(*) FROM numbers;
 count   
-----------
 250000000
(1 row)

9.5 - low memory:
Time: 24680.949 ms (24.68 seconds)
Time: 24299.386 ms (25.29 seconds)
9.5 - high memory:
Time: 23654.839 ms (23.65 seconds)
Time: 23862.691 ms (23.86 seconds)
10.4 - low memory
Time: 42735.018 ms (00:42.735)
Time: 43440.366 ms (00:43.440)

Counting distinct elements is a substantially more complicated operation and that is reflected in these results. Because in our dataset every value is unique this represents something of a worst case scenario for the operation. Interestingly the high memory configuration does WORSE than the default.

> SELECT COUNT(DISTINCT a) FROM numbers;
   count
-----------
 250000000
(1 row)

9.5 - low memory:
Time: 156700.367 ms (156.70 seconds)
Time: 156720.367 ms (156.72 seconds)
9.5 - high memory:
Time: 191722.235 ms (191.72 seconds)
Time: 191692.405 ms (191.69 seconds)
10.4 - low memory:
Time: 169099.920 ms (02:49.100)
Time: 167650.529 ms (02:47.651)

This is the probabilistic equivalent of the above count distinct. Notice that the result is not exact but is fairly accurate even over 250 million rows. Most importantly this is 2.6 times faster than the naive count distinct.

> SELECT #hll_add_agg(hll_hash_integer(a)) FROM numbers;
     ?column?     
------------------
 252805209.899419
(1 row)

low memory:
Time: 59015.349 ms (59.01 seconds)
Time: 58908.294 ms (58.90 seconds)
high memory:
Time: 58471.748 ms (58.47 seconds)
Time: 58019.861 ms (58.19 seconds)

Now we can take a look at the column store. The first test is again the standard row count. This result is about 4 seconds faster than the row store version.

> SELECT COUNT(*) FROM numberscol;
   count   
-----------
 250000000
(1 row)

low memory:
Time: 21362.647 ms (21.36 seconds)
Time: 21074.633 ms (21.07 seconds)
high memory:
Time: 21662.268 ms (21.66 seconds)
Time: 21623.051 ms (21.62 seconds)

Naive count distinct is around 10 seconds faster on the column store.

> SELECT COUNT(DISTINCT a) FROM numberscol;
   count
-----------
 250000000
(1 row)

low memory:
Time: 145933.710 ms (145.93 seconds)
Time: 147137.995 ms (147.13 seconds)
high memory:
Time: 177630.437 ms (177.63 seconds)
Time: 178484.288 ms (178.48 seconds)

Finally the combination of hyperloglog and a column store. The column store version of hll is 5 seconds faster than the hll row layout, and 2.9 times faster than the precise row store count distinct.

> SELECT #hll_add_agg(hll_hash_integer(a)) FROM numberscol;
     ?column?     
------------------
 252805209.899419
(1 row)

low memory:
Time: 54251.376 ms (54.25 seconds)
Time: 53785.085 ms (53.78 seconds)
high memory:
Time: 60465.237 ms (60.46 seconds)
Time: 60363.777 ms (60.36 seconds)

Analysis

For this particular simple query pattern it is clear that HLL and columnar storage are beneficial. Under more traditional OLTP workloads these designs would likely be a hinderence. That being said microbenchmarks are commonly misleading and more experimentation with workloads and configurations is always beneficial.

Despite being a multicore system Postgres only ever used a single thread for the query. This is a limitation of Postgres. The new version Postgres 10 has more advanced features for multicore query support. Hyperloglog based counting can easily be parallelized, this limitation of Postgres 9.5 is a severe hinderence to performance.

In the future I may replicate these tests with Postgres 10 and see if there is any substantial difference. The extensions do not officially list Postgres 10 as compatible and I need to ensure stability of the integration. Should they be compatible the multicore query support would likely be very beneficial.