RE: Proposal for optimizations with simd enabled sort

From: "Giacchino, Luca" <luca(dot)giacchino(at)intel(dot)com>
To: "R, Rakshit" <rakshit(dot)r(at)intel(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>, "Paul, Sourav Kumar" <sourav(dot)kumar(dot)paul(at)intel(dot)com>
Subject: RE: Proposal for optimizations with simd enabled sort
Date: 2025-07-02 18:26:56
Message-ID: CO6PR11MB5620FEE1EEF32F5E1242962F9540A@CO6PR11MB5620.namprd11.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

We'd like to share updated performance results for tuple sort after improving the benchmark to better stress sort. We added an offset to the query to eliminate the contribution of data movement to the client (referring to https://www.postgresql.org/message-id/flat/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com) With this change, flame graphs confirm that the sort contribution is much larger than with our previous setup (e.g., for 100k rows/100k distinct values using quicksort, tuplesort_performsort is now ~60% of the cycles compared to ~10% before). We also added data points at 10k and 50k rows.

Setup:
- Randomly generated integers, with varying row count and cardinality (controlled by number of distinct values)
- ORDER BY query with OFFSET equal to number of rows
- Query executed by pgbench (1 job, 1 client) measuring tps (transactions per second)
- Data collected on AWS m7i.metal-24xl, Ubuntu 24.04, gcc13

The table below reports measured tps gains for SIMD-enabled sort vs quicksort

10k rows, 10k distinct values: +92.0%
10k rows, 1k distinct values: +61.5%
10k rows, 100 distinct values: +31.3%
10k rows, 10 distinct values: +10.9%

50k rows, 50k distinct values: +101.2%
50k rows, 10k distinct values: +85.8%
50k rows, 100 distinct values: +33.3%
50k rows, 10 distinct values: +10.7%

100k rows, 100k distinct values: +92.0%
100k rows, 10k distinct values: +67.8%
100k rows, 100 distinct values: +16.1%
100k rows, 10 distinct values: -5.8%

1M rows, 1M distinct values: +49.1%
1M rows, 10k distinct values: +12.1%
1M rows, 100 distinct values: -17.5%
1M rows, 10 distinct values: -31.2%

In summary, no regressions are observed up to 50k rows. At 100k rows, some regression is observed for the very low-cardinality case. The trend continues to 1M rows.
It is easy to add dispatching logic at query execution time based on row count (to select SIMD sort or quicksort). Conditions based on cardinality may be more complex to implement. We'd appreciate any recommendations on planning-time or execution-time statistics we could use. We propose to start with the row count condition.
We are developing additional optimizations to improve performance for the 1M+ rows cases. We'll also explore different data distributions and share results as they are available.

Thanks!

Luca Giacchino

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-07-02 18:57:45 Re: Inconsistent LSN format in pg_waldump output
Previous Message John H 2025-07-02 18:21:54 Re: Making pg_rewind faster