Setup
*****

-- 100 million row table, 4977 MB overall size, high cardinality int4 attribute
-- to build index on.  Already entirely in shared_buffers in all tests (used
-- pg_prewarm for that, and took a few other noise-avoidance precautions):

CREATE TABLE big_high_cardinality_int4 AS
SELECT (random() * 2000000000)::int4 s,
        'abcdefghijlmn'::text junk
    FROM generate_series(1, 100000000);

Table details:

postgres=# \dt+ big_high_cardinality_int4
                             List of relations
 Schema |           Name            | Type  | Owner |  Size   | Description
--------+---------------------------+-------+-------+---------+-------------
 public | big_high_cardinality_int4 | table | pg    | 4977 MB |
(1 row)

Master branch (internal sort)
*****************************

With just enough memory to do an internal sort (master branch, so no explicit
memory prefetching):

postgres=# SET maintenance_work_mem = '5400MB';
SET
Time: 0.215 ms
postgres=# CREATE INDEX ON big_high_cardinality_int4(s);
CREATE INDEX
Time: 66628.886 ms      <---------

trace_sort output
-----------------

begin index sort: unique = f, workMem = 5529600, randomAccess = f
performsort starting: CPU 0.65s/13.50u sec elapsed 14.16 sec
performsort done: CPU 0.65s/43.07u sec elapsed 43.74 sec
internal sort ended, 5494829 KB used: CPU 6.92s/56.64u sec elapsed 66.41 sec     <---------

Patch series (external sort)
****************************

postgres=# set maintenance_work_mem = '1GB';
SET
Time: 0.383 ms
postgres=# CREATE INDEX ON big_high_cardinality_int4(s);
CREATE INDEX
Time: 78564.853 ms      <---------

trace_sort output
-----------------

begin index sort: unique = f, workMem = 1048576, randomAccess = f
switching to external sort with 3745 tapes: CPU 0.87s/2.53u sec elapsed 3.41 sec
hybrid sort-merge strategy used at row 19173960 crossover 0.825 (est 100000000.00 rows 5.22 runs)
starting quicksort of run 1: CPU 0.87s/2.53u sec elapsed 3.41 sec
finished quicksorting run 1: CPU 0.87s/7.54u sec elapsed 8.42 sec
finished writing run 1 to tape 0: CPU 1.09s/8.70u sec elapsed 9.81 sec
starting quicksort of run 2: CPU 1.09s/12.30u sec elapsed 13.41 sec
finished quicksorting run 2: CPU 1.09s/16.52u sec elapsed 17.63 sec
finished writing run 2 to tape 1: CPU 1.34s/17.46u sec elapsed 18.98 sec
starting quicksort of run 3: CPU 1.34s/21.08u sec elapsed 22.60 sec
finished quicksorting run 3: CPU 1.34s/25.31u sec elapsed 26.83 sec
finished writing run 3 to tape 2: CPU 1.51s/26.33u sec elapsed 28.02 sec
starting quicksort of run 4: CPU 1.51s/29.95u sec elapsed 31.64 sec
finished quicksorting run 4: CPU 1.51s/34.17u sec elapsed 35.86 sec
finished writing run 4 to tape 3: CPU 1.67s/35.17u sec elapsed 37.03 sec
starting quicksort of run 5: CPU 1.67s/38.80u sec elapsed 40.66 sec
finished quicksorting run 5: CPU 1.67s/43.05u sec elapsed 44.91 sec
finished writing run 5 to tape 4: CPU 1.83s/44.06u sec elapsed 46.08 sec
performsort starting: CPU 1.83s/47.55u sec elapsed 49.57 sec
starting quicksort of run 6: CPU 1.83s/47.55u sec elapsed 49.57 sec
finished quicksorting run 6: CPU 1.83s/51.62u sec elapsed 53.64 sec
finished writing run 6 to tape 5: CPU 1.99s/52.59u sec elapsed 54.76 sec
performsort done (except 6-way final merge): CPU 2.55s/53.13u sec elapsed 55.87 sec
external sort ended, 244373 disk blocks used: CPU 9.05s/68.92u sec elapsed 78.50 sec     <---------

Conclusions
***********

Index details:

postgres=# \di+ big_high_cardinality_int4_s_idx
                                              List of relations
 Schema |              Name               | Type  | Owner |           Table           |  Size   | Description
--------+---------------------------------+-------+-------+---------------------------+---------+-------------
 public | big_high_cardinality_int4_s_idx | index | pg    | big_high_cardinality_int4 | 2142 MB |
(1 row)

Only an 18% overhead relative to master's internal sort performance (i.e. total
CREATE INDEX duration) is seen here.  Obviously, the patch series uses a small
fraction of the work_mem of master, and so is not actually comparable, but
internal sort performance seems close to an ideal to aim for.  Note that I have
avoided giving the baseline internal sort performance the benefit of explicit
memory prefetching (by actually using the master branch).

Presumably, a smaller difference (more favorable to the patch) can be observed
with a logged table, but that was avoided.  Although an unlogged table was
used, it was stored on a regular, durable ext4 partition on an SSD (which uses
LVM + disk encryption), which is where the index ended up, too (although the
index would not have been fsync()'d, since it was unlogged).  tmpfs probably
competed fairly aggressively with that for I/O due to using swap space (there
is only 16 GiB of memory on this system), and so I'm willing to believe that
this competition could be even closer still on a real server.  This is just how
one simple, representative case worked out on my laptop, which lacks real I/O
parallelism.  Note that the time spent writing out runs is actually really low
here (see trace_sort output), even though the temp_tablespace tmpfs partition
is backed only by a consumer-grade mobile SSD.  OS filesystem caching must help
here, but more I/O bandwidth would almost certainly make the patch series
faster for this case.

This result was obtained without going to the trouble of running tests on
hardware that plays to the strengths of the new patch.  More favorable CREATE
INDEX numbers seem quite possible, but this was not a priority ahead of patch
submission.  A big sort (e.g. one that requires 200GB of temp files before the
merge phase) on a server with lots of memory, and plenty of temp_tablespace I/O
bandwidth would probably be more interesting.

The merge phase's new cache efficiency helps at a point involving comparisons
generally more likely to indicate equality, often overall tuple equality, but
especially equality for earlier attributes.  I imagine this helps a lot with
multi-attribute CREATE INDEX cases with moderate cardinality leading
attributes.  It must also help with the TID tie-breaker within
comparetup_index_btree(), since pointer chasing need only read from the
IndexTuple already in L1 cache.  The fact that second-or-subsequent attributes
are not directly represented within each SortTuple becomes significantly less
important at the right time.