|From:||Andres Freund <andres(at)anarazel(dot)de>|
|To:||Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|Cc:||Justin Pryzby <pryzby(at)telsasoft(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Asim R P <apraveen(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>|
|Subject:||Re: Pluggable Storage - Andres's take|
|Views:||Raw Message | Whole Thread | Download mbox|
On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
> Hm. I think some of those changes would be a bit bigger than I initially
> though. Attached is a more minimal fix that'd route
> RelationGetNumberOfBlocksForFork() through tableam if necessary. I
> think it's definitely the right answer for 1), probably the pragmatic
> answer to 2), but certainly not for 3).
> I've for now made the AM return the size in bytes, and then convert that
> into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> are going to continue to want it internally as pages (otherwise there's
> going to be way too much churn, without a benefit I can see). So I
> thinkt that's OK.
> There's also a somewhat weird bit of returning the total relation size
> for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
> postgres' current forks, but have equivalent concepts. And without that
> there'd be no way to get that size. I'm not sure I like this, input
> welcome. But it seems good to offer the ability to get the entire size
I'm still reasonably happy with this. I'll polish it a bit and push.
> > 3) nodeTidscan, skipping over too large tids
> > I think this should just be moved into the AMs, there's no need to
> > have this in nodeTidscan.c
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
> * We silently discard any TIDs that are out of range at the time of scan
> * start. (Since we hold at least AccessShareLock on the table, it won't
> * be possible for someone to truncate away the blocks we intend to
> * visit.)
> nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan. And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter. But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.
Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.
For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
Needs a bit of polishing, but I think this is the right direction?
Unless somebody protests, I'm going to push something along those lines
|Next Message||Andres Freund||2019-05-15 19:02:09||Are ctid chaining loops safe without relation size checks?|
|Previous Message||Alvaro Herrera||2019-05-15 18:30:05||more message fixes|