Re: Unhappy about API changes in the no-fsm-for-small-rels patch

From: Andres Freund <andres(at)anarazel(dot)de>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Date: 2019-04-17 15:59:35
Message-ID: 20190417155935.ujpbxsrr5mzlqtqr@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-04-17 13:09:05 +0800, John Naylor wrote:
> On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Hi,
> >
> > I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
> > complexity that looks like it should be purely in freespacemap.c to
> > callers.
> >
> >
> > extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
> > -extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
> > +extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
> > + bool check_fsm_only);
> >
> > So now freespace.c has an argument that says we should only check the
> > fsm. That's confusing. And it's not explained to callers what that
> > argument means, and when it should be set.
>
> When first looking for free space, it's "false": Within
> GetPageWithFreeSpace(), we call RelationGetNumberOfBlocks() if the FSM
> returns invalid.
>
> If we have to extend, after acquiring the lock to extend the relation,
> we call GetPageWithFreeSpace() again to see if another backend already
> extended while waiting on the lock. If there's no FSM, the thinking
> is, it's not worth it to get the number of blocks again.

I can get that (after reading through the code, grepping through all
callers, etc), but it means that every callsite needs to understand
that. That's making the API more complicated than needed, especially
when we're going to grow more callers.

> > +/* Only create the FSM if the heap has greater than this many blocks */
> > +#define HEAP_FSM_CREATION_THRESHOLD 4
> >
> > Hm, this seems to be tying freespace.c closer to heap than I think is
> > great - think of new AMs like zheap, that also want to use it.
>
> Amit and I kept zheap in mind when working on the patch. You'd have to
> work around the metapage, but everything else should work the same.

My complaint is basically that it's apparently AM specific (we don't use
the logic for e.g. indexes), and that the name suggest it's specific to
heap. And it's not controllable by the outside, which means it can't be
tuned for the specific usecase.

> > I think this is mostly fallout about the prime issue I'm unhappy
> > about. There's now some global variable in freespacemap.c that code
> > using freespace.c has to know about and maintain.
> >
> >
> > +static void
> > +fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> > +{
> > + BlockNumber blkno,
> > + cached_target_block;
> > +
> > + /* The local map must not be set already. */
> > + Assert(!FSM_LOCAL_MAP_EXISTS);
> > +
> > + /*
> > + * Starting at the current last block in the relation and working
> > + * backwards, mark alternating blocks as available.
> > + */
> > + blkno = cur_nblocks - 1;
> >
> > That comment explains very little about why this is done, and why it's a
> > good idea.
>
> Short answer: performance -- it's too expensive to try every block.
> The explanation is in storage/freespace/README -- maybe that should be
> referenced here?

Yes. Even just adding "for performance reasons, only try every second
block. See also the README" would be good.

But I'll note that the need to this - and potentially waste space -
counters your claim that there's no performance considerations with this
patch.

> > +/* Status codes for the local map. */
> > +
> > +/* Either already tried, or beyond the end of the relation */
> > +#define FSM_LOCAL_NOT_AVAIL 0x00
> > +
> > +/* Available to try */
> > +#define FSM_LOCAL_AVAIL 0x01
> >
> > +/* Local map of block numbers for small heaps with no FSM. */
> > +typedef struct
> > +{
> > + BlockNumber nblocks;
> > + uint8 map[HEAP_FSM_CREATION_THRESHOLD];
> > +} FSMLocalMap;
> > +
> >
> > Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
> > really only need one bit per relation, it seems like map should really
> > be just a uint32 with one bit per page.
>
> I fail to see the advantage of that.

It'd allow different AMs to have different numbers of dont-create-fsm
thresholds without needing additional memory (up to 32 blocks).

> > I'm kinda thinking that this is the wrong architecture.
> >
> > 1) Unless I miss something, this will trigger a
> > RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
> > once for each page we add to the relation.
>
> That was true previously anyway if the FSM returned InvalidBlockNumber.

True. That was already pretty annoying though.

> > That strikes me as pretty
> > suboptimal. I think it's even worse if multiple backends do the
> > insertion, because the RelationGetTargetBlock(relation) logic will
> > succeed less often.
>
> Could you explain why it would succeed less often?

Two aspects: 1) If more backends access a table, there'll be a higher
chance the target page is full 2) There's more backends that don't have
a target page.

> > 2) We'll repeatedly re-encounter the first few pages, because we clear
> > the local map after each successful RelationGetBufferForTuple().
>
> Not exactly sure what you mean? We only set the map if
> RelationGetTargetBlock() returns InvalidBlockNumber, or if it returned
> a valid block, and inserting there already failed. So, not terribly
> often, I imagine.

It's pretty common to have small tables that are modified by a number of
backends. A typical case is tables that implement locks for external
processes and such.

> On Wed, Apr 17, 2019 at 3:16 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Hi,
> >
> > On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
> <snip>
> > > and in any case it seems pretty inefficient for workloads that insert
> > > into multiple tables.
> >
> > As is, it's inefficient for insertions into a *single* relation. The
> > RelationGetTargetBlock() makes it not crazily expensive, but it's still
> > plenty expensive.
>
> Performance testing didn't reveal any performance regression. If you
> have a realistic benchmark in mind that stresses this logic more
> heavily, I'd be happy to be convinced otherwise.

Well, try a few hundred relations on nfs (where stat is much more
expensive). Or just pgbench a concurrent workload with a few tables with
one live row each, updated by backends (to simulate lock tables and
such).

But also, my concern here is to a significant degree architectural,
rather than already measurable performance regressions. We ought to work
towards eliminating unnecessary syscalls, not the opposite.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2019-04-17 16:05:17 Re: [patch] pg_test_timing does not prompt illegal option
Previous Message Bruce Momjian 2019-04-17 15:57:35 Re: block-level incremental backup