Re: logtape.c stats don't account for unused "prefetched" block numbers

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logtape.c stats don't account for unused "prefetched" block numbers
Date: 2020-09-05 19:03:43
Message-ID: CAH2-Wzkt-w+Puq_RKXJEaPdCuY-jw23VT7wFKh-XgBRm3_RDaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 1, 2020 at 5:24 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Sep 1, 2020 at 4:36 PM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> > This open item hasn't received any replies. I think Peter knows how to
> > fix it already, but no patch has been posted ... It'd be good to get a
> > move on it.
>
> I picked this up again today.

One easy way to get logtape.c to behave in the same way as Postgres 12
for a multi-pass external sort (i.e. to use fewer blocks and to report
the number of blocks used accurately) is to #define
TAPE_WRITE_PREALLOC_MIN and TAPE_WRITE_PREALLOC_MAX to 1. So it looks
like the problem is in the preallocation stuff added by commit
896ddf9b3cd, and not the new heap-based free list logic added by
commit c02fdc92230. That's good news, because it means that the
problem may be fairly well isolated -- commit 896ddf9b3cd was a pretty
small and isolated thing.

The comments in ltsWriteBlock() added by the 2017 bugfix commit
7ac4a389a7d clearly say that the zero block writing stuff is only
supposed to happen at the edge of a tape boundary, which ought to be
rare -- see the comment block in ltsWriteBlock(). And yet the new
preallocation stuff explicitly relies on that it writing zero blocks
much more frequently. I'm concerned that that can result in increased
and unnecessary I/O, especially for sorts, but also for hash aggs that
spill. I'm also concerned that having preallocated-but-allocated
blocks confuses the accounting used by
trace_sort/LogicalTapeSetBlocks().

Separately, it's possible to make the
trace_sort/LogicalTapeSetBlocks() instrumentation agree with the
filesystem by replacing the use of nBlocksAllocated within
LogicalTapeSetBlocks() with nBlocksWritten -- that seems to make the
instrumentation correct without changing the current behavior at all.
But I'm not ready to endorse that approach, since it's not quite clear
what nBlocksAllocated and nBlocksWritten mean right now -- those two
fields were both added by the aforementioned 2017 bugfix commit, which
introduced the "allocated vs written" distinction in the first place.

We should totally disable the preallocation stuff for external sorts
in any case. External sorts are naturally characterized by relatively
large, distinct batching of reads and writes -- preallocation cannot
help.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-09-05 21:06:58 Re: v13: show extended stats target in \d
Previous Message Ranier Vilela 2020-09-05 18:58:51 Re: A micro-optimisation for walkdir()