Subtle bug in "Simplify tape block format" commit

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Subtle bug in "Simplify tape block format" commit
Date: 2017-01-06 22:33:26
Message-ID: CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It seems that commit 01ec2563 has a subtle bug, which stems from the
fact that logtape.c no longer follows the rule described above
ltsGetFreeBlock():

/*
* Select a currently unused block for writing to.
*
* NB: should only be called when writer is ready to write immediately,
* to ensure that first write pass is sequential.
*/
static long
ltsGetFreeBlock(LogicalTapeSet *lts)
{
...
}

Previously, all calls to ltsGetFreeBlock() were immediately followed
by a corresponding call to ltsWriteBlock(); we wrote out the
newly-allocated-in-first-pass block there and then. It's a good idea
for a corresponding ltsWriteBlock() call to happen immediately, and
it's *essential* for it to happen before any later block is written
during the first write pass (when tuples are initially dumped out to
form runs), since, as noted above ltsWriteBlock():

/*
* Write a block-sized buffer to the specified block of the underlying file.
*
* NB: should not attempt to write beyond current end of file (ie, create
* "holes" in file), since BufFile doesn't allow that. The first write pass
* must write blocks sequentially.
...

However, a "write beyond current end of file" now seems to actually be
attempted at times, resulting in an arcane error during sorting. This
is more or less because LogicalTapeWrite() doesn't heed the warning
from ltsGetFreeBlock(), as seen here:

while (size > 0)
{
if (lt->pos >= TapeBlockPayloadSize)
{
...

/*
* First allocate the next block, so that we can store it in the
* 'next' pointer of this block.
*/
nextBlockNumber = ltsGetFreeBlock(lts);

/* set the next-pointer and dump the current block. */
TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
...
}
...

}

Note that LogicalTapeWrite() now allocates each block (calls
ltsGetFreeBlock()) one block in advance of current block, immediately
before *current* block is written (not the next block/block just
allocated). So, the corresponding ltsWriteBlock() call *must* happen
at an unspecified time in the future, typically the next time control
ends up at exactly the same point, where the block that was "next"
becomes "current".

I'm not about to argue that we should go back to following this
ltsGetFreeBlock() rule, though; I can see why Heikki refactored
LogicalTapeWrite() to allocate blocks a block in advance. Still, we
need to be more careful in avoiding the underlying problem that the
ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-*
has logtape.c be sure to write out a tape's buffer every time
tuplesort ends a run.

I have a test case. Build Postgres with only 0002-* applied, which
makes MAX_PHYSICAL_FILESIZE far smaller than its current value
(current value is 2^30, while this reduces it to BLCKSZ). Then:

postgres=# set replacement_sort_tuples = 0; set work_mem = 64;
SET
SET
postgres=# with a as (select repeat('a', 7000) i from
generate_series(1, 7)) select * from a order by i;
ERROR: XX000: could not write block 5 of temporary file: Success
LOCATION: ltsWriteBlock, logtape.c:204

This "block 5" corresponds to an fd.c file/BufFile segment that does
not in fact exist when this error is raised. Since BufFiles multiplex
BLCKSZ-sized segment files in this build of Postgres, it's quite
likely, in general, that writes marking the end of a run will straddle
"segment boundaries", which is what it takes for buffile.c/logtape.c
to throw this error. (The BufFile contract does not allow "holes" in
files, but segment boundaries must be crossed as the critical point
(just before a tape switch) to actually see this error -- you'd have
to be very unlucky to have things line up in exactly the wrong way
with 1GiB BufFile segments, but it can still happen.)

--
Peter Geoghegan

Attachment Content-Type Size
0002-Reduce-MAX_PHYSICAL_FILESIZE-value-to-BLCKSZ.patch text/x-patch 996 bytes
0001-Fix-bug-in-logical-tapeset-free-block-management.patch text/x-patch 5.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-01-06 23:31:02 Re: Support for pg_receivexlog --format=plain|tar
Previous Message Claudio Freire 2017-01-06 22:18:45 Re: Block level parallel vacuum WIP