Setup ***** -- 100 million row table, 4977 MB overall size, high cardinality int4 attribute -- to build index on. Already entirely in shared_buffers in all tests (used -- pg_prewarm for that, and took a few other noise-avoidance precautions): CREATE TABLE big_high_cardinality_int4 AS SELECT (random() * 2000000000)::int4 s, 'abcdefghijlmn'::text junk FROM generate_series(1, 100000000); Table details: postgres=# \dt+ big_high_cardinality_int4 List of relations Schema | Name | Type | Owner | Size | Description --------+---------------------------+-------+-------+---------+------------- public | big_high_cardinality_int4 | table | pg | 4977 MB | (1 row) Master branch (internal sort) ***************************** With just enough memory to do an internal sort (master branch, so no explicit memory prefetching): postgres=# SET maintenance_work_mem = '5400MB'; SET Time: 0.215 ms postgres=# CREATE INDEX ON big_high_cardinality_int4(s); CREATE INDEX Time: 66628.886 ms <--------- trace_sort output ----------------- begin index sort: unique = f, workMem = 5529600, randomAccess = f performsort starting: CPU 0.65s/13.50u sec elapsed 14.16 sec performsort done: CPU 0.65s/43.07u sec elapsed 43.74 sec internal sort ended, 5494829 KB used: CPU 6.92s/56.64u sec elapsed 66.41 sec <--------- Patch series (external sort) **************************** postgres=# set maintenance_work_mem = '1GB'; SET Time: 0.383 ms postgres=# CREATE INDEX ON big_high_cardinality_int4(s); CREATE INDEX Time: 78564.853 ms <--------- trace_sort output ----------------- begin index sort: unique = f, workMem = 1048576, randomAccess = f switching to external sort with 3745 tapes: CPU 0.87s/2.53u sec elapsed 3.41 sec hybrid sort-merge strategy used at row 19173960 crossover 0.825 (est 100000000.00 rows 5.22 runs) starting quicksort of run 1: CPU 0.87s/2.53u sec elapsed 3.41 sec finished quicksorting run 1: CPU 0.87s/7.54u sec elapsed 8.42 sec finished writing run 1 to tape 0: CPU 1.09s/8.70u sec elapsed 9.81 sec starting quicksort of run 2: CPU 1.09s/12.30u sec elapsed 13.41 sec finished quicksorting run 2: CPU 1.09s/16.52u sec elapsed 17.63 sec finished writing run 2 to tape 1: CPU 1.34s/17.46u sec elapsed 18.98 sec starting quicksort of run 3: CPU 1.34s/21.08u sec elapsed 22.60 sec finished quicksorting run 3: CPU 1.34s/25.31u sec elapsed 26.83 sec finished writing run 3 to tape 2: CPU 1.51s/26.33u sec elapsed 28.02 sec starting quicksort of run 4: CPU 1.51s/29.95u sec elapsed 31.64 sec finished quicksorting run 4: CPU 1.51s/34.17u sec elapsed 35.86 sec finished writing run 4 to tape 3: CPU 1.67s/35.17u sec elapsed 37.03 sec starting quicksort of run 5: CPU 1.67s/38.80u sec elapsed 40.66 sec finished quicksorting run 5: CPU 1.67s/43.05u sec elapsed 44.91 sec finished writing run 5 to tape 4: CPU 1.83s/44.06u sec elapsed 46.08 sec performsort starting: CPU 1.83s/47.55u sec elapsed 49.57 sec starting quicksort of run 6: CPU 1.83s/47.55u sec elapsed 49.57 sec finished quicksorting run 6: CPU 1.83s/51.62u sec elapsed 53.64 sec finished writing run 6 to tape 5: CPU 1.99s/52.59u sec elapsed 54.76 sec performsort done (except 6-way final merge): CPU 2.55s/53.13u sec elapsed 55.87 sec external sort ended, 244373 disk blocks used: CPU 9.05s/68.92u sec elapsed 78.50 sec <--------- Conclusions *********** Index details: postgres=# \di+ big_high_cardinality_int4_s_idx List of relations Schema | Name | Type | Owner | Table | Size | Description --------+---------------------------------+-------+-------+---------------------------+---------+------------- public | big_high_cardinality_int4_s_idx | index | pg | big_high_cardinality_int4 | 2142 MB | (1 row) Only an 18% overhead relative to master's internal sort performance (i.e. total CREATE INDEX duration) is seen here. Obviously, the patch series uses a small fraction of the work_mem of master, and so is not actually comparable, but internal sort performance seems close to an ideal to aim for. Note that I have avoided giving the baseline internal sort performance the benefit of explicit memory prefetching (by actually using the master branch). Presumably, a smaller difference (more favorable to the patch) can be observed with a logged table, but that was avoided. Although an unlogged table was used, it was stored on a regular, durable ext4 partition on an SSD (which uses LVM + disk encryption), which is where the index ended up, too (although the index would not have been fsync()'d, since it was unlogged). tmpfs probably competed fairly aggressively with that for I/O due to using swap space (there is only 16 GiB of memory on this system), and so I'm willing to believe that this competition could be even closer still on a real server. This is just how one simple, representative case worked out on my laptop, which lacks real I/O parallelism. Note that the time spent writing out runs is actually really low here (see trace_sort output), even though the temp_tablespace tmpfs partition is backed only by a consumer-grade mobile SSD. OS filesystem caching must help here, but more I/O bandwidth would almost certainly make the patch series faster for this case. This result was obtained without going to the trouble of running tests on hardware that plays to the strengths of the new patch. More favorable CREATE INDEX numbers seem quite possible, but this was not a priority ahead of patch submission. A big sort (e.g. one that requires 200GB of temp files before the merge phase) on a server with lots of memory, and plenty of temp_tablespace I/O bandwidth would probably be more interesting. The merge phase's new cache efficiency helps at a point involving comparisons generally more likely to indicate equality, often overall tuple equality, but especially equality for earlier attributes. I imagine this helps a lot with multi-attribute CREATE INDEX cases with moderate cardinality leading attributes. It must also help with the TID tie-breaker within comparetup_index_btree(), since pointer chasing need only read from the IndexTuple already in L1 cache. The fact that second-or-subsequent attributes are not directly represented within each SortTuple becomes significantly less important at the right time.