Re: tuplesort memory usage: grow_memtuples

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tuplesort memory usage: grow_memtuples
Date: 2012-08-16 23:26:37
Message-ID: CAEYLb_VeZpKDX54VEx3X30oy_UOTh89XoejJW6aucjjiUjskXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 27 July 2012 16:39, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> Can you suggest a benchmark that will usefully exercise this patch?
>
> I think the given sizes below work on most 64 bit machines.

My apologies for not getting around to taking a look at this sooner.

I tested this patch on my x86_64 laptop:

[peter(at)peterlaptop tests]$ uname -r
3.4.4-4.fc16.x86_64

I have been able to recreate your results with the work_mem setting
you supplied, 16384 (both queries that you suggested are executed
together, for a less sympathetic workload):

[peter(at)peterlaptop tests]$ cat sortmemgrow_sort.sql
select count(distinct foo) from (select random() as foo from
generate_series(1,524200)) asdf;
select count(distinct foo) from (select random() as foo from
generate_series(1,524300)) asdf;
[peter(at)peterlaptop tests]$ # For master:
[peter(at)peterlaptop tests]$ pgbench -f sortmemgrow_sort.sql -T 45 -n
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 45
tps = 0.992526 (including connections establishing)
tps = 0.992633 (excluding connections establishing)
[peter(at)peterlaptop tests]$ # For patch:
[peter(at)peterlaptop tests]$ pgbench -f sortmemgrow_sort.sql -T 45 -n
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 66
tps = 1.461739 (including connections establishing)
tps = 1.461911 (excluding connections establishing)

I didn't find trace_sort all that useful for my earlier work on
optimising in memory-sorting (at least when running benchmarks), as
the granularity is too small for simple, relatively inexpensive
queries with sort nodes (or some tuplesort usage, like the datum sort
of your example). Also, the instrumentation can have a Heisenberg
effect. I've avoided using it here. The less expensive query's
executions costs are essentially the same with and without the patch.

This patch works by not doubling the size of state->memtupsize when
available memory is less than half of allowed memory (i.e. we are one
call to grow_memtuples() away from reporting ). Rather, in that
situation, it is set to a size that extrapolates the likely size of
the total amount of memory needed in a way that is quite flawed, but
likely to work well for the pass-by-value Datum case. Now, on the face
of it, this may appear to be a re-hashing of the question of "how
paranoid do we need to be about wasting memory in memory-constrained
situations - should we consider anything other than a geometric growth
rate, to be parsimonious with memory at the risk of thrashing?".
However, this is not really the case, because this is only a single,
last-ditch effort to avoid a particularly undesirable eventuality. We
have little to lose and quite a bit to gain. A cheap guestimation
seems reasonable.

I have attached a revision for your consideration, with a few
editorialisations, mostly style-related.

I removed this entirely:

+ * XXXX is the new method still follow this? The last allocation is no
+ * longer necessarily a power of 2, but that is not freed.

I did so because, according to aset.c:

* Further improvement 12/00: as the code stood, request sizes in the
* midrange between "small" and "large" were handled very inefficiently,
* because any sufficiently large free chunk would be used to satisfy a
* request, even if it was much larger than necessary. This led to more
* and more wasted space in allocated chunks over time. To fix, get rid
* of the midrange behavior: we now handle only "small" power-of-2-size
* chunks as chunks. Anything "large" is passed off to malloc(). Change
* the number of freelists to change the small/large boundary.

So, at the top of grow_memtuples, this remark still holds:

* and so there will be no unexpected addition to what we ask for. (The
* minimum array size established in tuplesort_begin_common is large
* enough to force palloc to treat it as a separate chunk, so this
* assumption should be good. But let's check it.)

It continues to hold because we're still invariably passing off this
request to malloc() (or, in this particular case, realloc())
regardless of the alignment of the request size. Certainly, the extant
code verifies !LACKMEM, and the regression tests pass when this patch
is applied.

However, there is still a danger of LACKMEM. This revision has the
following logic for extrapolating newmemtupsize (This is almost the
same as the original patch):

+ newmemtupsize = (int) floor(oldmemtupsize * allowedMem / memNowUsed);

Suppose that the code was:

+ newmemtupsize = (int) ceil(oldmemtupsize * allowedMem / memNowUsed);

We'd now have an error with your latter example query because
!LACKMEM, due to the inclusion of a single additional SortTuple. This
is because GetMemoryChunkSpace (and, ultimately,
AllocSetGetChunkSpace) return header size too. We were getting away
with that before with the doubling strategy, but now we have to factor
in that header's size into allowedMem.

I don't think my revision satisfactory in what it does here. It isn't
obvious what a better principled implementation would do though. Does
anyone have any other ideas?

I think this patch (or at least your observation about I/O waits
within vmstat) may point to a more fundamental issue with our sort
code: Why are we not using asynchronous I/O in our implementation?
There are anecdotal reports of other RDBMS implementations doing far
better than we do here, and I believe asynchronous I/O, pipelining,
and other such optimisations have a lot to do with that. It's
something I'd hoped to find the time to look at in detail, but
probably won't in the 9.3 cycle. One of the more obvious ways of
optimising an external sort is to use asynchronous I/O so that one run
of data can be sorted or merged while other runs are being read from
or written to disk. Our current implementation seems naive about this.
There are some interesting details about how this is exposed by POSIX
here:

http://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html

It's already anticipated that we might take advantage of libaio for
the benefit of FilePrefetch() (see its accompanying comments - it uses
posix_fadvise itself - effective_io_concurrency must be > 0 for this
to ever be called). It perhaps could be considered parallel
"low-hanging fruit" in that it allows us to offer limited though
useful backend parallelism without first resolving thorny issues
around what abstraction we might use, or how we might eventually make
backends thread-safe. AIO supports registering signal callbacks (a
SIGPOLL handler can be called), which seems relatively
uncontroversial.

Platform support for AIO might be a bit lacking, but then you can say
the same about posix_fadvise. We don't assume that poll(2) is
available, but we already use it where it is within the latch code.
Besides, in-kernel support can be emulated if POSIX threads is
available, which I believe would make this broadly useful on unix-like
platforms.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment Content-Type Size
sortmem_grow-v2.patch application/octet-stream 4.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-08-16 23:38:44 Re: Avoiding shutdown checkpoint at failover
Previous Message Bruce Momjian 2012-08-16 23:03:08 Re: heap_page_prune comments