Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-02-05 18:03:41
Message-ID: CAH2-WzkepmVTzF4EDJ7T-5urhtpZgL_ZYAV9=SsyKi2YGd_mTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 5, 2018 at 9:43 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> # LogicalTapeFreeze() may write out its first block when it is dirty but
> # not full, and then immediately read the first block back in from its
> # BufFile as a BLCKSZ-width block. This can only occur in rare cases
> # where next to no tuples were written out, which is only possible with
> # parallel external tuplesorts.
>
> So, if I understand correctly what you're saying here, valgrind is
> totally cool with us writing out an only-partially-initialized block
> to a disk file, but it's got a real problem with us reading that data
> back into the same memory space it already occupies.

That's not quite it. Valgrind is cool with a BufFileWrite(), which
doesn't result in an actual write() because the buffile.c stdio-style
buffer (which isn't where the uninitialized bytes originate from)
isn't yet filled. The actual write() comes later, and that's the point
that Valgrind complains. IOW, Valgrind is cool with copying around
uninitialized memory before we do anything with the underlying values
(e.g., write(), something that affects control flow).

> I presume that it's common for the tail of the final block
> written to be uninitialized, but normally when we then go read block
> 0, that's some other, fully initialized block.

It certainly is common. In the case of logtape.c, we almost always
write out some garbage bytes, even with serial sorts. The only
difference here is the *sense* in which they're garbage: they're
uninitialized bytes, which Valgrind cares about, rather than byte from
previous writes that are left behind in the buffer, which Valgrind
does not care about.

>> It might seem like my suppression is overly broad, or not broad
>> enough, since it essentially targets LogicalTapeFreeze(). I don't
>> think it is, though, because this can occur in two places within
>> LogicalTapeFreeze() -- it can occur in the place we actually saw the
>> issue on lousyjack, from the ltsReadBlock() call within
>> LogicalTapeFreeze(), as well as a second place -- when
>> BufFileExportShared() is called. I found that you have to tweak code
>> to prevent it happening in the first place before you'll see it happen
>> in the second place.
>
> I don't quite see how that would happen, because BufFileExportShared,
> at least AFAICS, doesn't touch the buffer?

It doesn't have to -- at least not directly. Valgrind remembers that
the uninitialized memory from logtape.c buffers are poisoned -- it
"spreads". The knowledge that the bytes are poisoned is tracked as
they're copied around. You get the error on the write() from the
BufFile buffer, despite the fact that you can make the error go away
by using palloc0() instead of palloc() within logtape.c, and nowhere
else.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-02-05 18:57:06 Re: WIP: BRIN multi-range indexes
Previous Message Claudio Freire 2018-02-05 17:58:00 Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently