Re: Pluggable Storage - Andres's take

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Asim R P <apraveen(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: Pluggable Storage - Andres's take
Date: 2019-04-25 22:43:15
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Heikki, Ashwin, Tom,

On 2019-04-23 15:52:01 -0700, Andres Freund wrote:
> On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> > index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM
> > doesn't use normal data files, that won't work. I bumped into that with my
> > toy implementation, which wouldn't need to create any data files, if it
> > wasn't for this.
> There are a few more of these:

> I'm not sure about doing so for v12 though. 1) and 3) are fairly
> trivial, but 2) would involve changing the FDW interface, by changing
> the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> we're not even in beta1.

Hm. I think some of those changes would be a bit bigger than I initially
though. Attached is a more minimal fix that'd route
RelationGetNumberOfBlocksForFork() through tableam if necessary. I
think it's definitely the right answer for 1), probably the pragmatic
answer to 2), but certainly not for 3).

I've for now made the AM return the size in bytes, and then convert that
into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
are going to continue to want it internally as pages (otherwise there's
going to be way too much churn, without a benefit I can see). So I
thinkt that's OK.

There's also a somewhat weird bit of returning the total relation size
for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
postgres' current forks, but have equivalent concepts. And without that
there'd be no way to get that size. I'm not sure I like this, input
welcome. But it seems good to offer the ability to get the entire size

Btw, isn't RelationGetNumberOfBlocksForFork() currently weirdly placed?
I don't see why bufmgr.c would be appropriate? Although I don't think
it's particulary clear where it'd best reside - I'd tentatively say

Heikki, Ashwin, your inputs would be appreciated here, in particular the
tid fetch bit below.

The attached patch isn't intended to be applied as-is, just basis for

> 1) index_update_stats(), computing pg_class.relpages
> Feels like the number of both heap and index blocks should be
> computed by the index build and stored in IndexInfo. That'd also get
> a bit closer towards allowing indexams not going through smgr (useful
> e.g. for memory only ones).

Due to parallel index builds that'd actually be hard. Given the number
of places wanting to compute relpages for pg_class I think the above
patch routing RelationGetNumberOfBlocksForFork() through tableam is the
right fix.

> 2) commands/analyze.c, computing pg_class.relpages
> This should imo be moved to the tableam callback. It's currently done
> a bit weirdly imo, with fdws computing relpages the callback, but
> then also returning the acquirefunc. Seems like it should entirely be
> computed as part of calling acquirefunc.

Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
tableam wouldn't be the right minimal approach too. It has the
disadvantage of implying certain values for the
RelationGetNumberOfBlocksForFork(MAIN) return value. The alternative
would be to return the desire sampling range in
table_beginscan_analyze() - but that'd require some duplication because
currently that just uses the generic scan_begin() callback.

I suspect - as previously mentioned- that we're going to have to extend
statistics collection beyond the current approach at some point, but I
don't think that's now. At least to me it's not clear how to best
represent the stats, and how to best use them, if the underlying storage
is fundamentally not block best. Nor how we'd avoid code duplication...

> 3) nodeTidscan, skipping over too large tids
> I think this should just be moved into the AMs, there's no need to
> have this in nodeTidscan.c

I think here it's *not* actually correct at all to use the relation
size. It's currently doing:

* We silently discard any TIDs that are out of range at the time of scan
* start. (Since we hold at least AccessShareLock on the table, it won't
* be possible for someone to truncate away the blocks we intend to
* visit.)
nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);

which is fine (except for a certain abstraction leakage) for an AM like
heap or zheap, but I suspect strongly that that's not ok for Ashwin &
Heikki's approach where tid isn't tied to physical representation.

The obvious answer would be to just move that check into the
table_fetch_row_version implementation (currently just calling
heap_fetch()) - but that doesn't seem OK from a performance POV, because
we'd then determine the relation size once for each tid, rather than
once per tidscan. And it'd also check in cases where we know the tid is
supposed to be valid (e.g. fetching trigger tuples and such).

The proper fix seems to be to introduce a new scan variant
(e.g. table_beginscan_tid()), and then have table_fetch_row_version take
a scan as a parameter. But it seems we'd have to introduce that as a
separate tableam callback, because we'd not want to incur the overhead
of creating an additional scan / RelationGetNumberOfBlocks() checks for
triggers et al.


Andres Freund

Attachment Content-Type Size
tableam-size.diff text/x-diff 6.4 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-04-25 23:29:18 Re: Do PostgreSQL have map and set structure(like STL in C++)?
Previous Message Andres Freund 2019-04-25 21:56:24 Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6