Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2016-03-08 06:57:38
Message-ID: CAM3SWZR-0uURKesmr8+mWR=vieqaRKsM6_tbRFzn9rw-GO6pCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 15, 2016 at 3:45 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> I was thinking about this over the past couple weeks. I'm starting to
> think the quicksort runs gives at least the beginnings of a way
> forward on this front.

As I've already pointed out several times, I wrote a tool that makes
it easy to load sortbenchmark.org data into a PostgreSQL table:

https://github.com/petergeoghegan/gensort

(You should use the Python script that invokes the "gensort" utility
-- see its "--help" display for details).

This seems useful as a standard benchmark, since it's perfectly
deterministic, allowing the user to create arbitrarily large tables to
use for sort benchmarks. Still, it doesn't produce data that is any
way organic; sort data is uniformly distributed. Also, it produces a
table that really only has one attribute to sort on, a text attribute.

I suggest looking at real world data, too. I have downloaded UK land
registry data, which is a freely available dataset about property
sales in the UK since the 1990s, of which there have apparently been
about 20 million (I started with a 20 million line CSV file). I've
used COPY to load the data into one PostgreSQL table.

I attach instructions on how to recreate this, and some suggested
CREATE INDEX statements that seemed representative to me. There are a
variety of Postgres data types in use, including UUID, numeric, and
text. The final Postgres table is just under 3GB. I will privately
make available a URL that those CC'd here can use to download a custom
format dump of the table, which comes in at 1.1GB (ask me off-list if
you'd like to get that URL, but weren't CC'd here). This URL is
provided as a convenience for reviewers, who can skip my detailed
instructions.

An expensive rollup() query on the "land_registry_price_paid_uk" table
is interesting. Example:

select date_trunc('year', transfer_date), county, district, city,
sum(price) from land_registry_price_paid_uk group by rollup (1,
county, district, city);

Performance is within ~5% of an *internal* sort with the patch series
applied, even though ~80% of time is spent copying and sorting
SortTuples overall in the internal sort case (the internal case cannot
overlap sorting and aggregate processing, since it has no final merge
step). This is a nice demonstration of how this work has significantly
blurred the line between internal and external sorts.

--
Peter Geoghegan

Attachment Content-Type Size
land-registry-data.txt text/plain 6.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-03-08 07:21:45 Re: silent data loss with ext4 / all current versions
Previous Message Andres Freund 2016-03-08 06:32:40 empty array case in plperl_ref_from_pg_array not handled correctly