Re: Pluggable Storage - Andres's take

From: Andres Freund <andres(at)anarazel(dot)de>
To: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Asim R P <apraveen(at)pivotal(dot)io>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: Pluggable Storage - Andres's take
Date: 2019-05-08 21:43:35
Message-ID: 20190508214335.wu6u5qp7cw2h3tgl@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-04-29 16:17:41 -0700, Ashwin Agrawal wrote:
> On Thu, Apr 25, 2019 at 3:43 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Hm. I think some of those changes would be a bit bigger than I initially
> > though. Attached is a more minimal fix that'd route
> > RelationGetNumberOfBlocksForFork() through tableam if necessary. I
> > think it's definitely the right answer for 1), probably the pragmatic
> > answer to 2), but certainly not for 3).
> >
> > I've for now made the AM return the size in bytes, and then convert that
> > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> > are going to continue to want it internally as pages (otherwise there's
> > going to be way too much churn, without a benefit I can see). So I
> > think that's OK.
>
> I will provide my inputs, Heikki please correct me or add your inputs.
>
> I am not sure how much gain this practically provides, if rest of the
> system continues to use the value returned in-terms of blocks. I
> understand things being block based (and not just really block based
> but all the blocks of relation are storing data and full tuple) is
> engraved in the system. So, breaking out of it is yes much larger
> change and not just limited to table AM API.

I don't think it's that ingrained in all that many parts of the
system. Outside of the places I listed upthread, and the one index case
that stashes extra info, which places are that "block based"?

> I feel most of the issues discussed here should be faced by zheap as
> well, as not all blocks/pages contain data like TPD pages should be
> excluded from sampling and TID scans, etc...

It's not a problem so far, and zheap works on tableam. You can just skip
such blocks during sampling / analyze, and return nothing for tidscans.

> > > 2) commands/analyze.c, computing pg_class.relpages
> > >
> > > This should imo be moved to the tableam callback. It's currently done
> > > a bit weirdly imo, with fdws computing relpages the callback, but
> > > then also returning the acquirefunc. Seems like it should entirely be
> > > computed as part of calling acquirefunc.
> >
> > Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> > tableam wouldn't be the right minimal approach too. It has the
> > disadvantage of implying certain values for the
> > RelationGetNumberOfBlocksForFork(MAIN) return value. The alternative
> > would be to return the desire sampling range in
> > table_beginscan_analyze() - but that'd require some duplication because
> > currently that just uses the generic scan_begin() callback.
>
> Yes, just routing relation size via AM layer and using its returned
> value in terms of blocks still and performing sampling based on blocks
> based on it, doesn't feel resolves the issue. Maybe need to delegate
> sampling completely to AM layer. Code duplication can be avoided by
> similar AMs (heap and zheap) possible using some common utility
> functions to achieve intended result.

I don't know what this is actually proposing.

> > I suspect - as previously mentioned- that we're going to have to extend
> > statistics collection beyond the current approach at some point, but I
> > don't think that's now. At least to me it's not clear how to best
> > represent the stats, and how to best use them, if the underlying storage
> > is fundamentally not block best. Nor how we'd avoid code duplication...
>
> Yes, will have to give more thoughts into this.
>
> >
> > > 3) nodeTidscan, skipping over too large tids
> > > I think this should just be moved into the AMs, there's no need to
> > > have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> > /*
> > * We silently discard any TIDs that are out of range at the time of scan
> > * start. (Since we hold at least AccessShareLock on the table, it won't
> > * be possible for someone to truncate away the blocks we intend to
> > * visit.)
> > */
> > nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
>
> Agree, its not nice to have that optimization being performed based on
> number of block in generic layer. I feel its not efficient either for
> zheap too due to TPD pages as mentioned above, as number of blocks
> returned will be higher compared to actually data blocks.

I don't think there's a problem for zheap. The blocks are just
interspersed.

Having pondered this a lot more, I think this is the way to go for
12. Then we can improve this for v13, to be nice.

> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a parameter. But it seems we'd have to introduce that as a
> > separate tableam callback, because we'd not want to incur the overhead
> > of creating an additional scan / RelationGetNumberOfBlocks() checks for
> > triggers et al.
>
> Thinking out loud here, we can possibly tackle this in multiple ways.
> First above mentioned check seems more optimization to me than
> functionally needed, correct me if wrong. If that's true we can check
> with AM if wish to apply that optimization or not based on relation
> size.

It'd be really expensive to check this differently for heap. We'd have
to check the relation size, which is out of the question imo.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-05-08 21:46:27 Re: Pluggable Storage - Andres's take
Previous Message Vik Fearing 2019-05-08 21:22:10 Re: New EXPLAIN option: ALL