Re: Automatically setting work_mem

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Automatically setting work_mem
Date: 2006-03-21 20:05:50
Message-ID: 1142971550.24487.499.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Fri, 2006-03-17 at 09:46 -0500, Tom Lane wrote:
> "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> > So what's the difference between these two strategy?
> > (1) Running time: do they use the same amount of memory? Why option 2 is
> > better than 1?
> > (2) Idle time: after sort done, option 1 will return all 1024 to the OS and
> > 2 will still keep 512?
>
> Point 2 is actually a serious flaw in Simon's proposal, because there
> is no portable way to make malloc return freed memory to the OS. Some
> mallocs will do that ... in some cases ... but many simply don't ever
> move the brk address down. It's not an easy thing to do when the arena
> gets cluttered with a lot of different alloc chunks and only some of
> them get freed.

I'm aware of that objection and agree its an issue...

One of the situations I am looking at is larger queries with multiple
sorts in them. I'm getting some reasonable results for final merge even
after releasing lots of memory (say 50-90%). memtuples array is not
required at all for randomAccess sorts (100% reduction).

The largest requirement for memory is the run building during
performsort. That portion of the code is not concurrently executed
within the same query. If we can reduce memory usage after that phase
completes then we stand a chance of not overusing memory on a big query
and not being able to reclaim it.

So overall memory usage could be as low as work_mem + (numsorts * 0.1 *
work_mem) which is a lot less than numsorts * work_mem. (e.g. 130% of
work_mem rather than 300% work_mem).

> So the semantics we'd have to adopt is that once a backend claims some
> "shared work mem", it keeps it until process exit. I don't think that
> makes the idea worthless, because there's usually a clear distinction
> between processes doing expensive stuff and processes doing cheap
> stuff. But it's definitely a limitation.

...Hopefully less so with mem reduction changes.

The other way is of course to allocate *all* sort/hash/big stuff space
out of shared memory and then let all the backends fight it out
(somehow...) to see who gets access to it. That way backends stay small
and we have well bounded memory usage.

> Also, if you've got a process
> doing expensive stuff, it's certainly possible to expect the user to
> just increase work_mem locally.

Doing that is fine, but you have to retune the system every time the
memory usage changes for any reason. So most of the time manual tuning
has to be very conservative to avoid getting it wrong.

> My own thoughts about the problems with our work_mem arrangement are
> that the real problem is the rule that we can allocate work_mem per sort
> or hash operation; this makes the actual total memory use per backend
> pretty unpredictable for nontrivial queries. I don't know how to fix
> this though. The planner needs to know the work_mem that will be used
> for any one of these operations in order to estimate costs, so simply
> trying to divide up work_mem among the operations of a completed plan
> tree is not going to improve matters.

Spent about 5 hours discussing that and the best answer was "use
queuing"....

= = = =

Anyway, thinking just about sort, I've got the following concrete
suggestions (first two of which coded and tested):

1. I originally picked MERGE_BUFFER_SIZE at 32 blocks as a guess. Better
test results show that is indeed the optimum when we take into account
both intermediate and final merging. However, the preferred buffer size
would be about 256 blocks in the case that Nruns << Ntapes i.e. when
work_mem is set high. In this case memory is reallocated to reduce the
overall usage after "performsort done" has happened.

2. When a tape runs out of tuples, its memory is reallocated to
remaining tapes to increase their I/O efficiency. This should help to
increase performance for smaller work_mem settings with large sorts, or
anything where the final merge Nruns is close to Ntapes. You can see
this occurring in the example below. The reallocation is either done
uniformly or all onto a single tape, depending upon the preread pattern.
This mostly doesn't occur with well sorted output, so there is little
overhead from doing this in the general case.

Right now, large sort performance is very good, whereas smaller sort
perfomance is still fairly bad.

3. We implement new merge alogorithm as Tom suggested...

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-03-21 20:08:40 Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)
Previous Message Martijn van Oosterhout 2006-03-21 19:09:09 Re: [GENERAL] A real currency type

Browse pgsql-patches by date

  From Date Subject
Next Message Simon Riggs 2006-03-21 20:33:50 Re: WAL logging of SELECT ... INTO command
Previous Message Magnus Hagander 2006-03-21 16:05:35 Re: FW: Win32 unicode vs ICU