Re: Pluggable Storage - Andres's take

From: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Asim R P <apraveen(at)pivotal(dot)io>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: Pluggable Storage - Andres's take
Date: 2019-04-29 23:17:41
Message-ID: CALfoeiv3Y3jvX_JL=5dTzquyPCR0KD3qfUoN4oZ7BHoBKiwn+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 25, 2019 at 3:43 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hm. I think some of those changes would be a bit bigger than I initially
> though. Attached is a more minimal fix that'd route
> RelationGetNumberOfBlocksForFork() through tableam if necessary. I
> think it's definitely the right answer for 1), probably the pragmatic
> answer to 2), but certainly not for 3).
>
> I've for now made the AM return the size in bytes, and then convert that
> into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> are going to continue to want it internally as pages (otherwise there's
> going to be way too much churn, without a benefit I can see). So I
> think that's OK.

I will provide my inputs, Heikki please correct me or add your inputs.

I am not sure how much gain this practically provides, if rest of the
system continues to use the value returned in-terms of blocks. I
understand things being block based (and not just really block based
but all the blocks of relation are storing data and full tuple) is
engraved in the system. So, breaking out of it is yes much larger
change and not just limited to table AM API.

I feel most of the issues discussed here should be faced by zheap as
well, as not all blocks/pages contain data like TPD pages should be
excluded from sampling and TID scans, etc...

> There's also a somewhat weird bit of returning the total relation size
> for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
> postgres' current forks, but have equivalent concepts. And without that
> there'd be no way to get that size. I'm not sure I like this, input
> welcome. But it seems good to offer the ability to get the entire size
> somehow.

Yes, I do think we should have mechanism to get total size and just
size for specific purpose. Zedstore currently doesn't use forks. Just
a thought, instead of calling it forknum as argument, call it
something like data and meta-data or main-data and auxiliary-data
size. Though I don't know if usage exists where wish to get size of
just some non MAIN fork for heap/zheap, those pieces of code shouldn't
be in generic areas instead in AM specific code only.

>
> > 2) commands/analyze.c, computing pg_class.relpages
> >
> > This should imo be moved to the tableam callback. It's currently done
> > a bit weirdly imo, with fdws computing relpages the callback, but
> > then also returning the acquirefunc. Seems like it should entirely be
> > computed as part of calling acquirefunc.
>
> Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> tableam wouldn't be the right minimal approach too. It has the
> disadvantage of implying certain values for the
> RelationGetNumberOfBlocksForFork(MAIN) return value. The alternative
> would be to return the desire sampling range in
> table_beginscan_analyze() - but that'd require some duplication because
> currently that just uses the generic scan_begin() callback.

Yes, just routing relation size via AM layer and using its returned
value in terms of blocks still and performing sampling based on blocks
based on it, doesn't feel resolves the issue. Maybe need to delegate
sampling completely to AM layer. Code duplication can be avoided by
similar AMs (heap and zheap) possible using some common utility
functions to achieve intended result.

>
> I suspect - as previously mentioned- that we're going to have to extend
> statistics collection beyond the current approach at some point, but I
> don't think that's now. At least to me it's not clear how to best
> represent the stats, and how to best use them, if the underlying storage
> is fundamentally not block best. Nor how we'd avoid code duplication...

Yes, will have to give more thoughts into this.

>
> > 3) nodeTidscan, skipping over too large tids
> > I think this should just be moved into the AMs, there's no need to
> > have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
> /*
> * We silently discard any TIDs that are out of range at the time of scan
> * start. (Since we hold at least AccessShareLock on the table, it won't
> * be possible for someone to truncate away the blocks we intend to
> * visit.)
> */
> nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.

Agree, its not nice to have that optimization being performed based on
number of block in generic layer. I feel its not efficient either for
zheap too due to TPD pages as mentioned above, as number of blocks
returned will be higher compared to actually data blocks.

>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan. And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).

Agree, checking relation size per tuple is out of possible solutions.

>
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter. But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.

Thinking out loud here, we can possibly tackle this in multiple ways.
First above mentioned check seems more optimization to me than
functionally needed, correct me if wrong. If that's true we can check
with AM if wish to apply that optimization or not based on relation
size. Like for Zedstore, instead of performing this optimization
directly call fetch row version and zedstore can quickly bail out
based on TID passed to it, as in meta page has highest allocated TID
value. With concurrent inserts though it may perform more work.

Or other alternative could be instead of getting relation size. Add
callback to get highest TID value from AM. heap and zheap can return
TID using highest block number and max TID that block can have.
Zedstore can return the highest TID it has assigned so far. Either use
the TID then perform the check and not just block-number. Or extract
block number from the TID and use that instead for the check. That
would at least work for AMs we know of so far and hard to imagine for
AMs doesn't exist yet, how this will be used.

Irrespective of how we solve this problem, ctids are displayed and
need to be specified in (block, offset) fashion for tid scans :-)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-04-30 01:23:51 Re: Plain strdup() in frontend code
Previous Message Alexander Korotkov 2019-04-29 22:20:06 Re: jsonpath