Re: Subtle bug in "Simplify tape block format" commit

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Subtle bug in "Simplify tape block format" commit
Date: 2017-01-30 12:04:18
Message-ID: 2044d466-beb9-b2ef-aec8-408a8ca4c5d0@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/07/2017 12:33 AM, Peter Geoghegan wrote:
> Previously, all calls to ltsGetFreeBlock() were immediately followed
> by a corresponding call to ltsWriteBlock(); we wrote out the
> newly-allocated-in-first-pass block there and then. It's a good idea
> for a corresponding ltsWriteBlock() call to happen immediately, and
> it's *essential* for it to happen before any later block is written
> during the first write pass (when tuples are initially dumped out to
> form runs), since, as noted above ltsWriteBlock():
>
> /*
> * Write a block-sized buffer to the specified block of the underlying file.
> *
> * NB: should not attempt to write beyond current end of file (ie, create
> * "holes" in file), since BufFile doesn't allow that. The first write pass
> * must write blocks sequentially.
> ...

I completely missed that rule :-(.

> However, a "write beyond current end of file" now seems to actually be
> attempted at times, resulting in an arcane error during sorting. This
> is more or less because LogicalTapeWrite() doesn't heed the warning
> from ltsGetFreeBlock(), as seen here:
>
> while (size > 0)
> {
> if (lt->pos >= TapeBlockPayloadSize)
> {
> ...
>
> /*
> * First allocate the next block, so that we can store it in the
> * 'next' pointer of this block.
> */
> nextBlockNumber = ltsGetFreeBlock(lts);
>
> /* set the next-pointer and dump the current block. */
> TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
> ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
> ...
> }
> ...
>
> }
>
> Note that LogicalTapeWrite() now allocates each block (calls
> ltsGetFreeBlock()) one block in advance of current block, immediately
> before *current* block is written (not the next block/block just
> allocated). So, the corresponding ltsWriteBlock() call *must* happen
> at an unspecified time in the future, typically the next time control
> ends up at exactly the same point, where the block that was "next"
> becomes "current".
>
> I'm not about to argue that we should go back to following this
> ltsGetFreeBlock() rule, though; I can see why Heikki refactored
> LogicalTapeWrite() to allocate blocks a block in advance. Still, we
> need to be more careful in avoiding the underlying problem that the
> ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-*
> has logtape.c be sure to write out a tape's buffer every time
> tuplesort ends a run.

Hmm. The LogicalTapeEndWriting() function is very similar to the
LogicalTapePause() function I had in the pause/resume patch. They both
flush the last block to disk. The difference is that LogicalTapePause()
also free'd the buffer, and read back the last block from the disk if
you continued writing, while LogicalTapeEndWriting() keeps a copy of it
in memory.

With the proposed fix (or with the pause/resume), you can only write to
a single tape at a time. Not a problem at the moment, but something to
consider. At least, would need more comments to make that more clear,
and an Assert would be nice.

Alternatively, we could fix this with a small change in ltsWriteBlock(),
see attached patch. When we're about to create a hole in the file, write
all-zero blocks to avoid the creating hole, before the block itself.
That's not quite as efficient as writing the actual block contents into
the hole, which avoids having to write it later, but probably won't make
any measurable difference in practice. I'm inclined to do that, because
it relaxes the rules on what you're allowed to do, in what order, which
makes this more robust in general. We coul *also* have something like
LogicalTapeEndWriting() or LogicalTapePause(), for efficiency, but it
doesn't seem that important.

> I have a test case.
> ...

Thanks for the analysis!

- Heikki

Attachment Content-Type Size
fix-out-of-order-ltsWriteBlock-1.patch text/plain 3.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-01-30 12:05:18 Re: Push down more full joins in postgres_fdw
Previous Message tushar 2017-01-30 11:36:28 Re: Parallel Index-only scan