Re: Memory usage during sorting

From: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory usage during sorting
Date: 2012-02-14 09:44:51
Message-ID: CAP7QgmkKb+81F_fNsheJBT9uc_8+RmfqNhjiJkYP-A-h3Eg8zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> wrote:
>> On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>>
>>> The attached patch allows it to reuse that memory.  On my meager
>>> system it reduced the building of an index on an integer column in a
>>> skinny 200 million row totally randomly ordered table by about 3% from
>>> a baseline of 25 minutes.
>>>
>>
>> Just to give a standard review, this patch is one line change and
>> applies cleanly, builds ok.
>>
>> I'm not pretty sure what exactly you're trying to accomplish, but it
>> seems to me that it's avoiding the first dumptuples cycle by forcing
>> availMem = 0 even if it's negative.
>
> Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
> tape-sort, it starts by calling WRITETUP a large number of time
> consecutively, to work off the memory deficit incurred by the 3 blocks
> per tape of tape overhead, and then after that calls WRITETUP about
> once per puttuple..   Under my patch, it would only call WRITETUP
> about once per puttuple, right from the beginning.
>
>> I read your comments as it'd be
>> avoiding to alternate reading/writing back and force with scattered
>> memory failing memory cache much during merge phase, but actually it
>> doesn't affect merge phase but only init-dump phase, does it?
>
> It effects the building of the runs.  But this building of the runs is
> not a simple dump, it is itself a mini merge phase, in which it merges
> the existing in-memory priority queue against the still-incoming
> tuples from the node which invoked the sort.  By using less memory
> than it could, this means that the resulting runs are smaller than
> they could be, and so will sometimes necessitate an additional layer
> of merging later on.   (This effect is particularly large for the very
> first run being built.  Generally by merging incoming tuples into the
> memory-tuples, you can create runs that are 1.7 times the size of fits
> in memory.  By wasting some memory, we are getting 1.7 the size of a
> smaller starting point.  But for the first run, it is worse than that.
>  Most of the benefit that leads to that 1.7 multiplier comes at the
> very early stage of each run-build.  But by initially using the full
> memory, then writing out a bunch of tuples without doing any merge of
> the incoming, we have truncated the part that gives the most benefit.)
>
> My analysis that the "freed" memory is never reused (because we refuse
> to reuse it ourselves and it is too fragmented to be reused by anyone
> else, like the palloc or VM system) only applies to the run-building
> phase.  So never was a bit of an overstatement.  By the time the last
> initial run is completely written out to tape, the heap used for the
> priority queue should be totally empty.  So at this point the
> allocator would have the chance to congeal all of the fragmented
> memory back into larger chunks, or maybe it parcels the allocations
> back out again in an order so that the unused space is contiguous and
> could be meaningfully paged out.
>
> But, it is it worth worrying about how much we fragment memory and if
> we overshoot our promises by 10 or 20%?
>
>> If so,
>> I'm not so convinced your benchmark gave 3 % gain by this change.
>> Correct me as I'm probably wrong.
>
> I've now done more complete testing.  Building an index on an
> 200,000,000 row table with an integer column populated in random order
> with integers from 1..500,000,000, non-unique, on a machine with 2GB
> of RAM and 600MB of shared_buffers.
>
> It improves things by 6-7 percent at the end of working mem size, the
> rest are in the noise except at 77936 KB, where it reproducibly makes
> things 4% worse, for reasons I haven't figured out.  So maybe the best
> thing to do is, rather than micromanaging memory usage, simply don't
> set maintenance_work_mem way to low.  (But, it is the default).

I've tested here with only a million rows mix of integer/text (table
size is 80MB or so) with default setting, running CREATE INDEX
continuously, but couldn't find performance improvement. The number
varies from -2% to +2%, which I think is just error.

While I agree with your insist that avoiding the first dump would make
sense, I guess it depends on situations; if the dump goes with lots of
tuples (which should happen when availMem is big), writing tuples a
lot at a time will be faster than writing little by little later.

I'm not sure about the conclusion, but given this discussion, I'm
inclined to mark this Returned with Feedback.

Thanks,
--
Hitoshi Harada

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei KaiGai 2012-02-14 09:55:27 Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [v9.2] Fix Leaky View Problem)
Previous Message Shigeru Hanada 2012-02-14 09:09:54 Re: pgsql_fdw, FDW for PostgreSQL server