Re: [PERFORM] A Better External Sort?

From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: Dann Corbit <DCorbit(at)connx(dot)com>
Cc: Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] A Better External Sort?
Date: 2005-09-26 21:27:30
Message-ID: 36e6829205092614277b7fa262@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Ron,

Having rested my brain for the last few days, your theory made for
interesting reading... Rather than argue the technical specs, I'd love to
see an implementation :)

-Jonah

On 9/26/05, Dann Corbit <DCorbit(at)connx(dot)com> wrote:
>
>
>
> > -----Original Message-----
> > From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> > owner(at)postgresql(dot)org] On Behalf Of Ron Peacetree
> > Sent: Monday, September 26, 2005 10:47 AM
> > To: pgsql-hackers(at)postgresql(dot)org; pgsql-performance(at)postgresql(dot)org
> > Subject: [HACKERS] [PERFORM] A Better External Sort?
> >
> > >From: Ron Peacetree <rjpeace(at)earthlink(dot)net>
> > >Sent: Sep 24, 2005 6:30 AM
> > >Subject: Re: [HACKERS] [PERFORM] Releasing memory during External
> > sorting?
> > >
> > >... the amount of IO done is the most
> > >important of the things that you should be optimizing for in
> > >choosing an external sorting algorithm.
> > >
> > > <snip>
> > >
> > >Since sorting is a fundamental operation in many parts of a DBMS,
> > >this is a Big Deal.
> > >
> > >This discussion has gotten my creative juices flowing. I'll post
> > >some Straw Man algorithm sketches after I've done some more
> > >thought.
> > >
> > As a thought exeriment, I've been considering the best way to sort 1TB
> > (2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records.
> >
> > Part I: A Model of the System
> > The performance of such external sorts is limited by HD IO, then
> > memory IO, and finally CPU throughput.
> >
> > On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
> > case) to ~1/128 (single HD best case. No more than one seek every
> > ~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.
> >
> > RAID HD IO will be in the range from as low as a single HD (RAID 1) to
> > ~1/8 (a RAID system saturating the external IO bus) the throughput of
> > RAM.
> >
> > RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
> > pathways internal to the CPU.
> >
> > This model suggests that HD IO will greatly dominate every other
> > factor, particuarly if we are talking about a single HD rather than a
> > peripheral bus saturating RAID subsystem. If at all possible, we want
> > to access the HD subsystem only once for each data item,
>
> If you can achieve that, I think you should be given a Nobel Prize, and
> I mean that sincerely. I also think that your analysis is interesting.
>
> > and we want
> > to avoid seeking more than the critical number of seeks implied above
> > when doing it. It also suggests that at a minimum, it's worth it to
> > spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
> > Far more than that if we are talking about a single random access.
> >
> > It's worth spending ~128 CPU operations to avoid a single random RAM
> > access, and literally 10's or even 100's of thousands of CPU
> operations to
> > avoid a random HD access. In addition, there are many indications in
> > current ECE and IT literature that the performance gaps between these
> > pieces of computer systems are increasing and expected to continue to
> do
> > so for the forseeable future. In short, _internal_ sorts have some,
> and
> > are
> > going to increasingly have more, of the same IO problems usually
> > associated with external sorts.
>
> Knuth has made the observation (confirmed by others) that 40% of
> mainframe CPU cycles are spent on sorting. Hence, any sort of
> optimization in this area is a potential for enormous savings.
>
> > Part II: a Suggested Algorithm
> > The simplest case is one where we have to order the data using a key
> that
> > only has two values.
>
> I suggest testing against a very large class of distributions. All of
> the common statistical models are a start (Gaussian, Poisson, etc.) and
> also single value, two distinct values, to some limit.
>
> > Given 2^40B of data using 2KB or 4KB per record, the most compact
> > representation we can make of such a data set is to assign a 32b= 4B
> RID
> > or Rptr for location + a 1b key for each record. Just the RID's would
> > take up
> > 1.25GB (250M records) or 2.5GB (500M records). Enough space that even
> > an implied ordering of records may not fit into RAM.
> >
> > Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive
> in
> > terms
> > of IO operations than sorting the actual 1TB of data.
> >
> > That IO cost can be lowered even further if instead of actually
> physically
> > sorting the RIDs, we assign a RID to the appropriate catagory inside
> the
> > CPU
> > as we scan the data set and append the entries in a catagory from CPU
> > cache
> > to a RAM file in one IO burst whenever said catagory gets full inside
> the
> > CPU.
> > We can do the same with either RAM file to HD whenever they get full.
> The
> > sorted order of the data is found by concatenating the appropriate
> files
> > at the
> > end of the process.
> >
> > As simple as this example is, it has many of the characteristics we
> are
> > looking for:
> > A= We access each piece of data once on HD and in RAM.
> > B= We do the minimum amount of RAM and HD IO, and almost no random IO
> in
> > either case.
> > C= We do as much work as possible within the CPU.
> > D= This process is stable. Equal keys stay in the original order they
> are
> > encountered.
> >
> > To generalize this method, we first need our 1b Key to become a
> > sufficiently large
> > enough Key or KeyPrefix to be useful, yet not so big as to be CPU
> cache
> > unfriendly.
> >
> > Cache lines (also sometimes called "blocks") are usually 64B= 512b in
> > size.
> > Therefore our RID+Key or KeyPrefix should never be larger than this.
> For
> > a 2^40B
> > data set, a 5B RID leaves us with potentially as much as 59B of Key or
> > KeyPrefix.
> > Since the data can't take on more than 40b worth different values
> > (actually 500M= 29b
> > for our example), we have more than adequate space for Key or
> KeyPrefix.
> > We just
> > have to figure out how to use it effectively.
> > A typical CPU L2 cache can hold 10's or 100's of thousands of such
> cache
> > lines.
> > That's enough that we should be able to do a significant amount of
> useful
> > work within
> > the CPU w/o having to go off-die.
> >
> > The data structure we are using to represent the sorted data also
> needs to
> > be
> > generalized. We want a space efficient DS that allows us to find any
> > given element in
> > as few accesses as possible and that allows us to insert new elements
> or
> > reorganize
> > the DS as efficiently as possible. This being a DB discussion list, a
> B+
> > tree seems like
> > a fairly obvious suggestion ;-)
> >
> > A B+ tree where each element is no larger than a cache line and no
> node is
> > larger than
> > what fits into L2 cache can be created dynamically as we scan the data
> set
> > via any of
> > the fast, low IO methods well known for doing so. Since the L2 cache
> can
> > hold 10's of
> > thousands of cache lines, it should be easy to make sure that the B+
> tree
> > has something
> > like 1000 elements per node (making the base of the logarithm for
> access
> > being at least
> > 1000). The log base 1000 of 500M is ~2.9, so that means that even in
> the
> > absolute
> > worst case where every one of the 500M records is unique we can find
> any
> > given
> > element in less than 3 accesses of the B+ tree. Increasing the order
> of
> > the B+ tree is
> > an option to reduce average accesses even further.
> >
> > Since the DS representing the sorted order of the data is a B+ tree,
> it's
> > very "IO friendly"
> > if we need to store part or all of it on HD.
> >
> > In an multiprocessor environment, we can assign chunks of the data set
> to
> > different
> > CPUs, let them build their independant B+ trees to represent the data
> in
> > sorted order from
> > their POV, and then merge the B+ trees very efficiently into one
> overall
> > DS to represent
> > the sorted order of the entire data set.
> >
> > Finally, since these are B+ trees, we can keep them around and easily
> > update them at will
> > for frequent used sorting conditions.
> >
> > What do people think?
>
> I think that your analysis is very interesting. I would like to see the
> result of the experiment.
>
> I think that the btrees are going to be O(n*log(n)) in construction of
> the indexes in disk access unless you memory map them [which means you
> would need stupendous memory volume] and so I cannot say that I really
> understand your idea yet. Can you draw a picture of it for me? (I am
> dyslexic and understand things far better when I can visualize it).
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-09-26 21:31:12 Re: 64-bit API for large objects
Previous Message Jim C. Nasby 2005-09-26 21:21:48 Re: 64-bit API for large objects