Re: Tuplesort merge pre-reading

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tuplesort merge pre-reading
Date: 2016-09-11 22:13:55
Message-ID: CAM3SWZQrU12mog=TG9jvNOpeUL5Krk9N1Y8GmsLGTGRBJui-aQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Sep 11, 2016 at 8:47 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Here's a new version of these patches, rebased over current master. I
> squashed the two patches into one, there's not much point to keep them
> separate.

I think I have my head fully around this now. For some reason, I
initially thought that this patch was a great deal more radical than
it actually is. (Like Greg, I somehow initially thought that you were
rejecting the idea of batch memory in general, and somehow (over)
leveraging the filesystem cache. I think I misunderstood your remarks
when we talked on IM about it early on.)

I don't know what the difference is between accessing 10 pages
randomly, and accessing a random set of 10 single pages sequentially,
in close succession. As Tom would say, that's above my pay grade. I
suppose it comes down to how close "close" actually is (but in any
case, it's all very fudged).

I mention this because I think that cost_sort() should be updated to
consider sequential I/O the norm, alongside this patch of yours (your
patch strengthens the argument [1] for that general idea). The reason
that this new approach produces more sequential I/O, apart from the
simple fact that you effectively have much more memory available and
so fewer rounds of preloading, is that the indirection block stuff can
make I/O less sequential in order to support eager reclamation of
space. For example, maybe there is interleaving of blocks as logtape.c
manages to reclaim blocks in the event of multiple merge steps. I care
about that second factor a lot more now than I would have a year ago,
when a final on-the-fly merge generally avoids multiple passes (and
associated logtape.c block fragmentation), because parallel CREATE
INDEX is usually affected by that (workers will often want to do their
own merge ahead of the leader's final merge), and because I want to
cap the number of tapes used, which will make multiple passes a bit
more common in practice.

I was always suspicious of the fact that memtuples is so large during
merging, and had a vague plan to fix that (although I was the one
responsible for growing memtuples even more for the merge in 9.6, that
was just because under the status quo of having many memtuples during
the merge, the size of memtuples should at least be in balance with
remaining memory for caller tuples -- it wasn't an endorsement of the
status quo). However, it never occurred to me to do that by pushing
batch memory into the head of logtape.c, which now seems like an
excellent idea.

To summarize my understanding of this patch: If it wasn't for my work
on parallel CREATE INDEX, I would consider this patch to give us only
a moderate improvement to user-visible performance, since it doesn't
really help memory rich cases very much (cases that are not likely to
have much random I/O anyway). In that universe, I'd be more
appreciative of the patch as a simplifying exercise, since while
refactoring. It's nice to get a boost for more memory constrained
cases, it's not a huge win. However, that's not the universe we live
in -- I *am* working on parallel CREATE INDEX, of course. And so, I
really am very excited about this patch, because it really helps with
the particular challenges I have there, even though it's fair to
assume that we have a reasonable amount of memory available when
parallelism is used. If workers are going to do their own merges, as
they often will, then multiple merge pass cases become far more
important, and any additional I/O is a real concern, *especially*
additional random I/O (say from logtape.c fragmentation). The patch
directly addresses that, which is great. Your patch, alongside my
just-committed patch concerning how we maintain the heap invariant,
together attack the merge bottleneck from two different directions:
they address I/O costs, and CPU costs, respectively.

Other things I noticed:

* You should probably point out that typically, access to batch memory
will still be sequential, despite your block-based scheme. The
preloading will now more or less make that the normal case. Any
fragmentation will now be essentially in memory, not on disk, which is
a big win.

* I think that logtape.c header comments are needed for this. Maybe
that's where you should point out that memory access is largely
sequential. But it's surely true that logtape.c needs to take
responsibility for being the place where the vast majority of memory
is allocated during merging.

* i think you should move "bool *mergeactive; /* active input run
source? */" within Tuplesortstate to be next to the other batch memory
stuff. No point in having separate merge and batch "sections" there
anymore.

* You didn't carry over my 0002-* batch memory patch modifications to
comments, even though you should have in a few cases. There remains
some references in comments to batch memory, as a thing exclusively
usable by final on-the-fly merges. That's not true anymore -- it's
usable by final merges, too. For example, readtup_alloc() still
references the final on-the-fly merge.

* You also fail to take credit in the commit message for making batch
memory usable when returning caller tuples to callers that happen to
request "randomAccess" (So, I guess the aforementioned comments above
routines like readtup_alloc() shouldn't even refer to merging, unless
it's to say that non-final merges are not supported due to their
unusual requirements). My patch didn't go that far (I only addressed
the final merge itself, not the subsequent access to tuples when
reading from that materialized final output tape by TSS_SORTEDONTAPE
case). But, that's actually really useful for randomAccess callers,
above and beyond what I proposed (which in any case was mostly written
with parallel workers in mind, which never do TSS_SORTEDONTAPE
processing).

* Furthermore, readtup_alloc() will not just be called in WRITETUP()
routines -- please update comments.

* There is a very subtle issue here:

> + /*
> + * We no longer need a large memtuples array, only one slot per tape. Shrink
> + * it, to make the memory available for other use. We only need one slot per
> + * tape.
> + */
> + pfree(state->memtuples);
> + FREEMEM(state, state->memtupsize * sizeof(SortTuple));
> + state->memtupsize = state->maxTapes;
> + state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple));
> + USEMEM(state, state->memtupsize * sizeof(SortTuple));

The FREEMEM() call needs to count the chunk overhead in both cases. In
short, I think you need to copy the GetMemoryChunkSpace() stuff that
you see within grow_memtuples().

* Whitespace issue here:

> @@ -2334,7 +2349,8 @@ inittapes(Tuplesortstate *state)
> #endif
>
> /*
> - * Decrease availMem to reflect the space needed for tape buffers; but
> + * Decrease availMem to reflect the space needed for tape buffers, when
> + * writing the initial runs; but
> * don't decrease it to the point that we have no room for tuples. (That
> * case is only likely to occur if sorting pass-by-value Datums; in all
> * other scenarios the memtuples[] array is unlikely to occupy more than
> @@ -2359,14 +2375,6 @@ inittapes(Tuplesortstate *state)
> state->tapeset = LogicalTapeSetCreate(maxTapes);

* I think that you need to comment on why state->tuplecontext is not
used for batch memory now. It is still useful, for multiple merge
passes, but the situation has notably changed for it.

* Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:

> @@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
> Assert(lt->frozen);
> datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
> }
> +
> + /* Allocate a read buffer */
> + if (lt->buffer)
> + pfree(lt->buffer);
> + lt->buffer = palloc(lt->read_buffer_size);
> + lt->buffer_size = lt->read_buffer_size;

* Typo:

> +
> + /*
> + * from this point on, we no longer use the usemem()/lackmem() mechanism to
> + * track memory usage of indivitual tuples.
> + */
> + state->batchused = true;

* Please make this use the ".., + 1023" natural rounding trick that is
used in the similar traces that are removed:

> +#ifdef TRACE_SORT
> + if (trace_sort)
> + elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape",
> + (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024);
> +#endif

* It couldn't hurt to make this code paranoid about LACKMEM() becoming
true, which will cause havoc (we saw this recently in 9.6; a patch of
mine to fix that just went in):

> + /*
> + * Use all the spare memory we have available for read buffers. Divide it
> + * memory evenly among all the tapes.
> + */
> + avail_blocks = state->availMem / BLCKSZ;
> + per_tape = avail_blocks / maxTapes;
> + cutoff = avail_blocks % maxTapes;
> + if (per_tape == 0)
> + {
> + per_tape = 1;
> + cutoff = 0;
> + }
> + for (tapenum = 0; tapenum < maxTapes; tapenum++)
> + {
> + LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
> + (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
> + }

In other words, we really don't want availMem to become < 0, since
it's int64, but a derived value is passed to
LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now,
if LACKMEM() did happen it would be a bug anyway, but I recommend
defensive code also be added. Per grow_memtuples(), "We need to be
sure that we do not cause LACKMEM to become true, else the space
management algorithm will go nuts". Let's be sure that we get that
right, since, as we saw recently, especially since grow_memtuples()
will not actually have the chance to save us here (if there is a bug
along these lines, let's at least make the right "can't happen error"
complaint to user when it pops up).

* It looks like your patch makes us less eager about freeing per-tape
batch memory, now held as preload buffer space within logtape.c.

ISTM that there should be some way to have the "tape exhausted" code
path within tuplesort_gettuple_common() (as well as the similar spot
within mergeonerun()) instruct logtape.c that we're done with that
tape. In other words, when mergeprereadone() (now renamed to
mergereadnext()) detects the tape is exhausted, it should have
logtape.c free its huge tape buffer immediately. Think about cases
where input is presorted, and earlier tapes can be freed quite early
on. It's worth keeping that around, (you removed the old way that this
happened, through mergebatchfreetape()).

That's all I have right now. I like the direction this is going in,
but I think this needs more polishing.

[1] https://www.postgresql.org/message-id/CAM3SWZQLP6e=1si1NcQjYft7R-VYpprrf_i59tZOZX5m7VFK-w@mail.gmail.com
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-09-11 22:23:21 Re: Useless dependency assumption libxml2 -> libxslt in MSVC scripts
Previous Message Michael Malis 2016-09-11 21:07:43 Re: Merge Join with an Index Optimization