Re: Memory usage during sorting

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory usage during sorting
Date: 2012-02-04 18:06:38
Message-ID: CAMkU=1w62NAULcpX26Ybv+VvQULHjfDZRuUPQM3xsMdsjNsSSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 21, 2012 at 4:51 PM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
> On 16 January 2012 00:59, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> I think it would be better to pre-deduct the tape overhead amount we
>> will need if we decide to switch to tape sort from the availMem before
>> we even start reading (and then add it back if we do indeed make that
>> switch).  That way we wouldn't over-run the memory in the first place.
>>  However, that would cause apparent regressions in which sorts that
>> previously fit into maintenance_work_mem no longer do.  Boosting
>> maintenance_work_mem to a level that was actually being used
>> previously would fix those regressions, but pointing out that the
>> previous behavior was not optimal doesn't change the fact that people
>> are used to it and perhaps tuned to it.  So the attached patch seems
>> more backwards-friendly.
>
> Hmm. Are people really setting maintenance_work_mem such that it is
> exactly large enough to quicksort when building an index in one case
> but not another?

It could also apply to work_mem in other situations. I'm just using
maintenance_work_mem in my example because index creation is the thing
that lead me into this little diversion and so that is what my
test-bed is set up to use.

Some people do have very stable "ritualized" work-loads and so could
have tuned exactly to them. I don't know how common that would be.
More common might be synthetic benchmarks which had been tuned, either
intentionally or accidentally, to just barely fit in the (overshot)
memory, so it would be a perception problem that there was a
regression when in fact the tuning merely became out of date.

> Is the difference large enough to warrant avoiding
> pre-deduction?

Switching to an external sort when you could have done it all by quick
sort (by cheating just a little bit on memory usage) makes a pretty
big difference, over 2 fold in my tests. If the fast-path
optimization for qsort goes in, the difference might get even bigger.
Of course, there will always be some point at which that switch over
must occur, so the real question is do we need to keep that switch
point historically consistent to avoid nasty surprises for people with
excessively fine-tuned *work_mem settings based on old behavior? And
do such people even exist outside of my imagination?

But a bigger question I now have is if any of this matters. Since
there is currently no way to know how many connections might be
running how many concurrent sorts, there is really no way to tune
work_mem with much precision. So, does it matter if we slightly lie
about how much we will actually use? And is it worth worrying about
whether every last drop of memory is used efficiently?

I was trying to compare the performance I was getting with a
theoretical model of what I should get, and it was confusing to me
that we were using a originally defined size of memory for the
priority heap, and then later reducing that amount of memory. That is
annoying because it complicates the theoretical model, and even more
annoying when I realized that the "freed" memory that results from
doing this is too fragmented to even be used for other purposes. But
theoretical annoyances aside, it is hardly the worst thing about
memory usage in sorting. It is just one that is standing in the way
of my further study.

So I don't think this patch I proposed is particularly important in
its own right. I want to see if anyone will point out that I am all
wet because my analysis failed to take <foo> into account. I
probably should have explained this when I submitted the patch.

The worst thing about the current memory usage is probably that big
machines can't qsort more than 16,777,216 tuples no matter how much
memory they have, because memtuples has to live entirely within a
single allocation, which is currently limited to 1GB. It is probably
too late to fix this problem for 9.2. I wish I had started there
rather than here.

This 16,777,216 tuple limitation will get even more unfortunate if the
specializations that speed up qsort but not external sort get
accepted.

Attached is a completely uncommitable proof of concept/work in
progress patch to get around the limitation. It shows a 2 fold
improvement when indexing an integer column on a 50,000,000 row
randomly ordered table.

As for what to do to make it commitable, it seems like maybe there
should be two different MaxAllocSize. One determined by the "aset.c
assumes they can compute twice an allocation's size without overflow".
I would think that simply testing size_t at compile time would be
enough. Allowing more than 1 GB allocations would probably be
pointless on 32 bit machines, and 64 bit should be able to compute
twice of a much larger value than 1GB without overflow.

For the varlena stuff, I don't know what to do. Is the "I swear to
God I won't pass this pointer to one of those functions" sufficient?
Or does each function need to test it?

And I'm pretty sure that palloc, not just repalloc, would need an
over-size alternative to make this a general-purpose improvement.

Cheers,

Jeff

Attachment Content-Type Size
memtuples_POCWIP.patch application/octet-stream 3.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2012-02-04 18:12:16 Re: Memory usage during sorting
Previous Message Tom Lane 2012-02-04 17:54:38 Re: [v9.2] Add GUC sepgsql.client_label