Quick Links

Re: Tuplesort merge pre-reading

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Tuplesort merge pre-reading
Date:	2016-09-29 14:41:25
Message-ID:	b5d82d34-80ee-5c3d-97a0-67ada4f3cf3e@iki.fi
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 09/29/2016 01:52 PM, Peter Geoghegan wrote:
> * Variables like maxTapes have a meaning that is directly traceable
> back to Knuth's description of polyphase merge. I don't think that you
> should do anything to them, on general principle.

Ok. I still think that changing maxTapes would make sense, when there
are fewer runs than tapes, but that is actually orthogonal to the rest
of the patch, so let's discuss that separately. I've changed the patch
to not do that.

> * Everything or almost everything that you've added to mergeruns()
> should probably be in its own dedicated function. This function can
> have a comment where you acknowledge that it's not perfectly okay that
> you claim memory per-tape, but it's simpler and faster overall.

Ok.

> * I think you should be looking at the number of active tapes, and not
> state->Level or state->currentRun in this new function. Actually,
> maybe this wouldn't be the exact definition of an active tape that we
> establish at the beginning of beginmerge() (this considers tapes with
> dummy runs to be inactive for that merge), but it would look similar.
> The general concern I have here is that you shouldn't rely on
> state->Level or state->currentRun from a distance for the purposes of
> determining which tapes need some batch memory. This is also where you
> might say something like: "we don't bother with shifting memory around
> tapes for each merge step, even though that would be fairer. That's
> why we don't use the beginmerge() definition of activeTapes --
> instead, we use our own broader definition of the number of active
> tapes that doesn't exclude dummy-run-tapes with real runs, making the
> allocation reasonably sensible for every merge pass".

I'm not sure I understood what your concern was, but please have a look
at this new version, if the comments in initTapeBuffers() explain that
well enough.

> * The "batchUsed" terminology really isn't working here, AFAICT. For
> one thing, you have two separate areas where caller tuples might
> reside: The small per-tape buffers (sized MERGETUPLEBUFFER_SIZE per
> tape), and the logtape.c controlled preread buffers (sized up to
> MaxAllocSize per tape). Which of these two things is batch memory? I
> think it might just be the first one, but KiBs of memory do not
> suggest "batch" to me. Isn't that really more like what you might call
> double buffer memory, used not to save overhead from palloc (having
> many palloc headers in memory), but rather to recycle memory
> efficiently? So, these two things should have two new names of their
> own, I think, and neither should be called "batch memory" IMV. I see
> several assertions remain here and there that were written with my
> original definition of batch memory in mind -- assertions like:

Ok. I replaced "batch" terminology with "slab allocator" and "slab
slots", I hope this is less confusing. This isn't exactly like e.g. the
slab allocator in the Linux kernel, as it has a fixed number of slots,
so perhaps an "object pool" might be more accurate. But I like "slab"
because it's not used elsewhere in the system. I also didn't use the
term "cache" for the "slots", as might be typical for slab allocators,
because "cache" is such an overloaded term.

> case TSS_SORTEDONTAPE:
> Assert(forward || state->randomAccess);
> Assert(!state->batchUsed);
>
> (Isn't state->batchUsed always true now?)

Good catch. It wasn't, because mergeruns() set batchUsed only after
checking for the TSS_SORTEDONTAPE case, even though it set up the batch
memory arena before it. So if replacement selection was able to produce
a single run batchUsed was false. Fixed, the slab allocator (as it's now
called) is now always used in TSS_SORTEDONTAPE case.

> And:
>
> case TSS_FINALMERGE:
> Assert(forward);
> Assert(state->batchUsed || !state->tuples);
>
> (Isn't state->tuples only really of interest to datum-tuple-case
> routines, now that you've made everything use large logtape.c preread
> buffers?)

Yeah, fixed.

> * Is is really necessary for !state->tuples cases to do that
> MERGETUPLEBUFFER_SIZE-based allocation? In other words, what need is
> there for pass-by-value datum cases to have this scratch/double buffer
> memory at all, since the value is returned to caller by-value, not
> by-reference? This is related to the problem of it not being entirely
> clear what batch memory now is, I think.

True, fixed. I still set slabAllocatorUsed (was batchUsed), but it's
initialized as a dummy 0-slot arena when !state->tuples.

Here's a new patch version, addressing the points you made. Please have
a look!

- Heikki

Attachment	Content-Type	Size
0001-Change-the-way-pre-reading-in-external-sort-s-merge-3.patch	text/x-patch	72.4 KB

In response to

Re: Tuplesort merge pre-reading at 2016-09-29 10:52:40 from Peter Geoghegan

Responses

Re: Tuplesort merge pre-reading at 2016-09-29 15:10:03 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2016-09-29 14:43:25	Re: Set log_line_prefix and application name in test drivers
Previous Message	Tom Lane	2016-09-29 14:38:49	Re: Bug in to_timestamp().