From: | The Hermit Hacker <scrappy(at)hub(dot)org> |
---|---|
To: | Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | Michael Richards <miker(at)scifair(dot)acadiau(dot)ca>, pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: [HACKERS] sorting big tables :( |
Date: | 1998-05-20 02:53:44 |
Message-ID: | Pine.BSF.3.96.980519234300.7420A-100000@thelab.hub.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, 19 May 1998, Bruce Momjian wrote:
> >
> > On Sun, 17 May 1998, Bruce Momjian wrote:
> >
> > > > > > I have a big table. 40M rows.
> > > > > > On the disk, it's size is:
> > > > > > 2,090,369,024 bytes. So 2 gigs. On a 9 gig drive I can't sort this table.
> > > > > > How should one decide based on table size how much room is needed?
> > >
> > > Tape sort is a standard Knuth sorting. It basically sorts in pieces,
> > > and merges. If you don't do this, the accessing around gets very poor
> > > as you page fault all over the file, and the cache becomes useless.
> > Right. I wasn't reading the right chapter. Internal sorting is much
> > different than external sorts. Internal suggests the use of a Quicksort
> > algorithim.
> > Marc and I discussed over lunch. If I did a select * into, would it not
> > make more sense to sort the results into the resulting table rather than
> > into pieces and then copy into a table? From my limited knowlege, I think
> > this should save 8/7 N the space.
> > In this issue, I think there must be a lot more overhead than necessary.
>
> Not sure if the internal tape is the same structure as a real table, but
> I doubt it. I seem to remember there is less overhead.
>
> > The table consists of only
> > int4, int4, int2
> > I read 10 bytes / row of actual data here.
> > Instead, 40M/2gigs is about
> > 50 bytes / record
> > What is there other than oid (4? bytes)
>
> Internal stuff so it looks like a real table, even though it is a
> result, I think.
Okay...I get to jump in here with both feet and arms flailing :)
Michael and I had lunch today and talked about this, and I asked him to
send an email in to the list about it...unfortunately, he didn't translate
our chat very well for here :)
This whole things makes absolutely no sense to me, as far as why it takes
2.5 times more space to *sort* the table as the table size itself.
He starts with a 2gig table, and it runs out of disk space on a 9gig file
system...
Now, looking at question 3.26 in the FAQ, we have:
40 bytes + each row header (approximate)
10 bytes + two int4 fields + one int2 field
4 bytes + pointer on page to tuple
-------- =
54 bytes per row
The data page size in PostgreSQL is 8192(8k) bytes, so:
8192 bytes per page
------------------- = 157 rows per database page (rounded up)
54 bytes per row
40000000 data rows
----------------- = 254777 database pages
157 rows per page
254777 database pages * 8192 bytes per page = 2,087,133,184 or ~1.9gig
Now, as a text file, this would amount to, what...~50MB?
So, if I were to do a 'copy out' to a text file, a Unix sort and then a
'copy in', I would use up *less* disk space (by several orders of
magnitude) then doing the sort inside of PostgreSQL?
Why?
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy(at)hub(dot)org secondary: scrappy(at){freebsd|postgresql}.org
From | Date | Subject | |
---|---|---|---|
Next Message | Michal Mosiewicz | 1998-05-20 12:12:15 | Re: [HACKERS] sorting big tables :( |
Previous Message | Bruce Momjian | 1998-05-20 01:54:08 | Re: [DOCS] Re: FE/BE protocol revision patch |