> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> owner(at)postgresql(dot)org] On Behalf Of Ron Peacetree
> Sent: Monday, September 26, 2005 10:47 AM
> To: pgsql-hackers(at)postgresql(dot)org; pgsql-performance(at)postgresql(dot)org
> Subject: [HACKERS] [PERFORM] A Better External Sort?
> >From: Ron Peacetree <rjpeace(at)earthlink(dot)net>
> >Sent: Sep 24, 2005 6:30 AM
> >Subject: Re: [HACKERS] [PERFORM] Releasing memory during External
> >... the amount of IO done is the most
> >important of the things that you should be optimizing for in
> >choosing an external sorting algorithm.
> > <snip>
> >Since sorting is a fundamental operation in many parts of a DBMS,
> >this is a Big Deal.
> >This discussion has gotten my creative juices flowing. I'll post
> >some Straw Man algorithm sketches after I've done some more
> As a thought exeriment, I've been considering the best way to sort 1TB
> (2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records.
> Part I: A Model of the System
> The performance of such external sorts is limited by HD IO, then
> memory IO, and finally CPU throughput.
> On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
> case) to ~1/128 (single HD best case. No more than one seek every
> ~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.
> RAID HD IO will be in the range from as low as a single HD (RAID 1) to
> ~1/8 (a RAID system saturating the external IO bus) the throughput of
> RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
> pathways internal to the CPU.
> This model suggests that HD IO will greatly dominate every other
> factor, particuarly if we are talking about a single HD rather than a
> peripheral bus saturating RAID subsystem. If at all possible, we want
> to access the HD subsystem only once for each data item,
If you can achieve that, I think you should be given a Nobel Prize, and
I mean that sincerely. I also think that your analysis is interesting.
> and we want
> to avoid seeking more than the critical number of seeks implied above
> when doing it. It also suggests that at a minimum, it's worth it to
> spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
> Far more than that if we are talking about a single random access.
> It's worth spending ~128 CPU operations to avoid a single random RAM
> access, and literally 10's or even 100's of thousands of CPU
> avoid a random HD access. In addition, there are many indications in
> current ECE and IT literature that the performance gaps between these
> pieces of computer systems are increasing and expected to continue to
> so for the forseeable future. In short, _internal_ sorts have some,
> going to increasingly have more, of the same IO problems usually
> associated with external sorts.
Knuth has made the observation (confirmed by others) that 40% of
mainframe CPU cycles are spent on sorting. Hence, any sort of
optimization in this area is a potential for enormous savings.
> Part II: a Suggested Algorithm
> The simplest case is one where we have to order the data using a key
> only has two values.
I suggest testing against a very large class of distributions. All of
the common statistical models are a start (Gaussian, Poisson, etc.) and
also single value, two distinct values, to some limit.
> Given 2^40B of data using 2KB or 4KB per record, the most compact
> representation we can make of such a data set is to assign a 32b= 4B
> or Rptr for location + a 1b key for each record. Just the RID's would
> take up
> 1.25GB (250M records) or 2.5GB (500M records). Enough space that even
> an implied ordering of records may not fit into RAM.
> Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive
> of IO operations than sorting the actual 1TB of data.
> That IO cost can be lowered even further if instead of actually
> sorting the RIDs, we assign a RID to the appropriate catagory inside
> as we scan the data set and append the entries in a catagory from CPU
> to a RAM file in one IO burst whenever said catagory gets full inside
> We can do the same with either RAM file to HD whenever they get full.
> sorted order of the data is found by concatenating the appropriate
> at the
> end of the process.
> As simple as this example is, it has many of the characteristics we
> looking for:
> A= We access each piece of data once on HD and in RAM.
> B= We do the minimum amount of RAM and HD IO, and almost no random IO
> either case.
> C= We do as much work as possible within the CPU.
> D= This process is stable. Equal keys stay in the original order they
> To generalize this method, we first need our 1b Key to become a
> sufficiently large
> enough Key or KeyPrefix to be useful, yet not so big as to be CPU
> Cache lines (also sometimes called "blocks") are usually 64B= 512b in
> Therefore our RID+Key or KeyPrefix should never be larger than this.
> a 2^40B
> data set, a 5B RID leaves us with potentially as much as 59B of Key or
> Since the data can't take on more than 40b worth different values
> (actually 500M= 29b
> for our example), we have more than adequate space for Key or
> We just
> have to figure out how to use it effectively.
> A typical CPU L2 cache can hold 10's or 100's of thousands of such
> That's enough that we should be able to do a significant amount of
> work within
> the CPU w/o having to go off-die.
> The data structure we are using to represent the sorted data also
> generalized. We want a space efficient DS that allows us to find any
> given element in
> as few accesses as possible and that allows us to insert new elements
> the DS as efficiently as possible. This being a DB discussion list, a
> tree seems like
> a fairly obvious suggestion ;-)
> A B+ tree where each element is no larger than a cache line and no
> larger than
> what fits into L2 cache can be created dynamically as we scan the data
> via any of
> the fast, low IO methods well known for doing so. Since the L2 cache
> hold 10's of
> thousands of cache lines, it should be easy to make sure that the B+
> has something
> like 1000 elements per node (making the base of the logarithm for
> being at least
> 1000). The log base 1000 of 500M is ~2.9, so that means that even in
> worst case where every one of the 500M records is unique we can find
> element in less than 3 accesses of the B+ tree. Increasing the order
> the B+ tree is
> an option to reduce average accesses even further.
> Since the DS representing the sorted order of the data is a B+ tree,
> very "IO friendly"
> if we need to store part or all of it on HD.
> In an multiprocessor environment, we can assign chunks of the data set
> CPUs, let them build their independant B+ trees to represent the data
> sorted order from
> their POV, and then merge the B+ trees very efficiently into one
> DS to represent
> the sorted order of the entire data set.
> Finally, since these are B+ trees, we can keep them around and easily
> update them at will
> for frequent used sorting conditions.
> What do people think?
I think that your analysis is very interesting. I would like to see the
result of the experiment.
I think that the btrees are going to be O(n*log(n)) in construction of
the indexes in disk access unless you memory map them [which means you
would need stupendous memory volume] and so I cannot say that I really
understand your idea yet. Can you draw a picture of it for me? (I am
dyslexic and understand things far better when I can visualize it).
pgsql-hackers by date
|Next:||From: Jim C. Nasby||Date: 2005-09-26 21:21:48|
|Subject: Re: 64-bit API for large objects|
|Previous:||From: Tom Lane||Date: 2005-09-26 21:03:53|
|Subject: Re: Re-run query on automatic reconnect |