Re: Memory usage during sorting

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory usage during sorting
Date: 2012-04-13 05:01:35
Message-ID: 1334293295.22913.21.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote:
> Yeah, that was me, and it came out of actual user complaints ten or more
> years back. (It's actually not 2X growth but more like 4X growth
> according to the comments in logtape.c, though I no longer remember the
> exact reasons why.) We knew when we put in the logtape logic that we
> were trading off speed for space, and we accepted that.

I skimmed through TAOCP, and I didn't find the 4X number you are
referring to, and I can't think what would cause that, either. The exact
wording in the comment in logtape.c is "4X the actual data volume", so
maybe that's just referring to per-tuple overhead?

However, I also noticed that section 5.4.4 (Vol 3 p299) starts
discussing the idea of running the tapes backwards and forwards. That
doesn't directly apply, because a disk seek is cheaper than rewinding a
tape, but perhaps it could be adapted to get the block-freeing behavior
we want. The comments in logtape.c say:

"Few OSes allow arbitrary parts of a file to be released back to the
OS, so we have to implement this space-recycling ourselves within a
single logical file."

But if we alternate between reading in forward and reverse order, we can
make all of the deallocations at the end of the file, and then just
truncate to free space. I would think that the OS could be more
intelligent about block allocations and deallocations to avoid too much
fragmentation, and it would probably be a net reduction in code
complexity. Again, the comments in logtape.c have something to say about
it:

"...but the seeking involved should be comparable to what would
happen if we kept each logical tape in a separate file"

But I don't understand why that is the case.

On another topic, quite a while ago Len Shapiro (PSU professor)
suggested to me that we implement Algorithm F (Vol 3 p321). Right now,
as tuplesort.c says:

"In the current code we determine the number of tapes M on the basis
of workMem: we want workMem/M to be large enough that we read a fair
amount of data each time we preread from a tape"

But with forcasting, we can be a little smarter about which tapes we
preread from if the data in the runs is not random. That means we could
potentially merge more runs at once with the same work_mem without
sacrificing adequate buffers for prefetching. I'm not sure whether this
is a practical problem today, and I'm also not sure what to do if we
start merging a lot more runs and then determine that forcasting doesn't
work as well as we'd hoped (e.g. that the data in the runs really is
random). But I thought it was worth mentioning.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2012-04-13 07:59:23 not null validation option in contrib/file_fdw
Previous Message Etsuro Fujita 2012-04-13 02:05:36 Re: [Patch] Fix little typo in a comment