Skip site navigation (1) Skip section navigation (2)

Re: [PERFORM] A Better External Sort?

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Ron Peacetree" <rjpeace(at)earthlink(dot)net>,<pgsql-hackers(at)postgresql(dot)org>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [PERFORM] A Better External Sort?
Date: 2005-09-26 21:13:02
Message-ID: D425483C2C5C9F49B5B7A41F8944154757D10E@postal.corporate.connx.com (view raw or flat)
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> owner(at)postgresql(dot)org] On Behalf Of Ron Peacetree
> Sent: Monday, September 26, 2005 10:47 AM
> To: pgsql-hackers(at)postgresql(dot)org; pgsql-performance(at)postgresql(dot)org
> Subject: [HACKERS] [PERFORM] A Better External Sort?
> 
> >From: Ron Peacetree <rjpeace(at)earthlink(dot)net>
> >Sent: Sep 24, 2005 6:30 AM
> >Subject: Re: [HACKERS] [PERFORM] Releasing memory during External
> sorting?
> >
> >... the amount of IO done is the most
> >important of the things that you should be optimizing for in
> >choosing an external sorting algorithm.
> >
> > <snip>
> >
> >Since sorting is a fundamental operation in many parts of a DBMS,
> >this is a Big Deal.
> >
> >This discussion has gotten my creative juices flowing.  I'll post
> >some Straw Man algorithm sketches after I've done some more
> >thought.
> >
> As a thought exeriment, I've been considering the best way to sort 1TB
> (2^40B) of 2-4KB (2^11-2^12B) records.  That's 2^28-2^29 records.
> 
> Part I: A Model of the System
> The performance of such external sorts is limited by HD IO, then
> memory IO, and finally CPU throughput.
> 
> On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
> case) to ~1/128 (single HD best case. No more than one seek every
> ~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.
> 
> RAID HD IO will be in the range from as low as a single HD (RAID 1) to
> ~1/8 (a RAID system saturating the external IO bus) the throughput of
> RAM.
> 
> RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
> pathways internal to the CPU.
> 
> This model suggests that HD IO will greatly dominate every other
> factor, particuarly if we are talking about a single HD rather than a
> peripheral bus saturating RAID subsystem. If at all possible, we want
> to access the HD subsystem only once for each data item,

If you can achieve that, I think you should be given a Nobel Prize, and
I mean that sincerely.  I also think that your analysis is interesting.

> and we want
> to avoid seeking more than the critical number of seeks implied above
> when doing it.  It also suggests that at a minimum, it's worth it to
> spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
> Far more than that if we are talking about a single random access.
> 
> It's worth spending ~128 CPU operations to avoid a single random RAM
> access, and literally 10's or even 100's of thousands of CPU
operations to
> avoid a random HD access.  In addition, there are many indications in
> current ECE and IT literature that the performance gaps between these
> pieces of computer systems are increasing and expected to continue to
do
> so for the forseeable future.  In short, _internal_ sorts have some,
and
> are
> going to increasingly have more, of the same IO problems usually
> associated with external sorts.

Knuth has made the observation (confirmed by others) that 40% of
mainframe CPU cycles are spent on sorting.  Hence, any sort of
optimization in this area is a potential for enormous savings.

> Part II: a Suggested Algorithm
> The simplest case is one where we have to order the data using a key
that
> only has two values.

I suggest testing against a very large class of distributions.  All of
the common statistical models are a start (Gaussian, Poisson, etc.) and
also single value, two distinct values, to some limit.

> Given 2^40B of data using 2KB or 4KB per record, the most compact
> representation we can make of such a data set is to assign a 32b= 4B
RID
> or Rptr for location + a 1b key for each record.  Just the RID's would
> take up
> 1.25GB (250M records) or 2.5GB (500M records).  Enough space that even
> an implied ordering of records may not fit into RAM.
> 
> Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive
in
> terms
> of IO operations than sorting the actual 1TB of data.
> 
> That IO cost can be lowered even further if instead of actually
physically
> sorting the RIDs, we assign a RID to the appropriate catagory inside
the
> CPU
> as we scan the data set and append the entries in a catagory from CPU
> cache
> to a RAM file in one IO burst whenever said catagory gets full inside
the
> CPU.
> We can do the same with either RAM file to HD whenever they get full.
The
> sorted order of the data is found by concatenating the appropriate
files
> at the
> end of the process.
> 
> As simple as this example is, it has many of the characteristics we
are
> looking for:
> A= We access each piece of data once on HD and in RAM.
> B= We do the minimum amount of RAM and HD IO, and almost no random IO
in
> either case.
> C= We do as much work as possible within the CPU.
> D= This process is stable.  Equal keys stay in the original order they
are
> encountered.
> 
> To generalize this method, we first need our 1b Key to become a
> sufficiently large
> enough Key or KeyPrefix to be useful, yet not so big as to be CPU
cache
> unfriendly.
> 
> Cache lines (also sometimes called "blocks") are usually 64B= 512b in
> size.
> Therefore our RID+Key or KeyPrefix should never be larger than this.
For
> a 2^40B
> data set, a 5B RID leaves us with potentially as much as 59B of Key or
> KeyPrefix.
> Since the data can't take on more than 40b worth different values
> (actually 500M= 29b
> for our example), we have more than adequate space for Key or
KeyPrefix.
> We just
> have to figure out how to use it effectively.
> A typical CPU L2 cache can hold 10's or 100's of thousands of such
cache
> lines.
> That's enough that we should be able to do a significant amount of
useful
> work within
> the CPU w/o having to go off-die.
> 
> The data structure we are using to represent the sorted data also
needs to
> be
> generalized.  We want a space efficient DS that allows us to find any
> given element in
> as few accesses as possible and that allows us to insert new elements
or
> reorganize
> the DS as efficiently as possible.  This being a DB discussion list, a
B+
> tree seems like
> a fairly obvious suggestion ;-)
> 
> A B+ tree where each element is no larger than a cache line and no
node is
> larger than
> what fits into L2 cache can be created dynamically as we scan the data
set
> via any of
> the fast, low IO methods well known for doing so.  Since the L2 cache
can
> hold 10's of
> thousands of cache lines, it should be easy to make sure that the B+
tree
> has something
> like 1000 elements per node (making the base of the logarithm for
access
> being at least
> 1000).  The log base 1000 of 500M is ~2.9, so that means that even in
the
> absolute
> worst case where every one of the 500M records is unique we can find
any
> given
> element in less than 3 accesses of the B+ tree.  Increasing the order
of
> the B+ tree is
> an option to reduce average accesses even further.
> 
> Since the DS representing the sorted order of the data is a B+ tree,
it's
> very "IO friendly"
> if we need to store part or all of it on HD.
> 
> In an multiprocessor environment, we can assign chunks of the data set
to
> different
> CPUs, let them build their independant B+ trees to represent the data
in
> sorted order from
> their POV, and then merge the B+ trees very efficiently into one
overall
> DS to represent
> the sorted order of the entire data set.
> 
> Finally, since these are B+ trees, we can keep them around and easily
> update them at will
> for frequent used sorting conditions.
> 
> What do people think?

I think that your analysis is very interesting.  I would like to see the
result of the experiment.

I think that the btrees are going to be O(n*log(n)) in construction of
the indexes in disk access unless you memory map them [which means you
would need stupendous memory volume] and so I cannot say that I really
understand your idea yet.  Can you draw a picture of it for me?  (I am
dyslexic and understand things far better when I can visualize it).

Responses

pgsql-hackers by date

Next:From: Jim C. NasbyDate: 2005-09-26 21:21:48
Subject: Re: 64-bit API for large objects
Previous:From: Tom LaneDate: 2005-09-26 21:03:53
Subject: Re: Re-run query on automatic reconnect

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group