Re: Pluggable Storage - Andres's take

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Asim R P <apraveen(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org, Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: Pluggable Storage - Andres's take
Date: 2019-04-11 16:49:47
Message-ID: 20190411164947.nkii4gaeilt4bui7@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote:
> Here is another iteration on the comments. The patch is a mix of
> copy-editing and questions. The questions are marked with "HEIKKI:". I can
> continue the copy-editing, if you can reply to the questions, clarifying the
> intention on some parts of the API. (Or feel free to pick and push any of
> these fixes immediately, if you prefer.)

Thanks!

> diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
> index f7f726b5aec..bbcab9ce31a 100644
> --- a/src/backend/utils/misc/guc.c
> +++ b/src/backend/utils/misc/guc.c
> @@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] =
> {"default_table_access_method", PGC_USERSET, CLIENT_CONN_STATEMENT,
> gettext_noop("Sets the default table access method for new tables."),
> NULL,
> - GUC_IS_NAME
> + GUC_NOT_IN_SAMPLE | GUC_IS_NAME
> },
> &default_table_access_method,
> DEFAULT_TABLE_ACCESS_METHOD,

Hm, I think we should rather add it to sample. That's an oversight, not
intentional.

> index 6fbfcb96c98..d4709563e7e 100644
> --- a/src/include/access/tableam.h
> +++ b/src/include/access/tableam.h
> @@ -91,8 +91,9 @@ typedef enum TM_Result
> * xmax is the outdating transaction's XID. If the caller wants to visit the
> * replacement tuple, it must check that this matches before believing the
> * replacement is really a match.
> + * HEIKKI: matches what? xmin, but that's specific to the heapam.

It's basically just the old comment moved. I wonder if we can just get
rid of that field - because the logic to follow update chains correctly
is now inside the lock tuple callback. And as you say - it's not clear
what callers can do with it for the purpose of following chains. The
counter-argument is that having it makes it a lot less annoying to adapt
external code that wants to adapt with the minimal set of changes, and
only is really interested in supporting heap for now.

> * GetTableAmRoutine() asserts that required callbacks are filled in, remember
> * to update when adding a callback.
> @@ -179,6 +184,12 @@ typedef struct TableAmRoutine
> *
> * if temp_snap is true, the snapshot will need to be deallocated at
> * scan_end.
> + *
> + * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's
> + * a bit surprising for the AM, no? Can it be called when a scan is
> + * already in progress?

Yea, it can be called when the scan is in-progress. I think we probably
should just fix calling code to not need that - it's imo weird that
nodeBitmapHeapscan.c doesn't just delay starting the scan till it has
the snapshot. This isn't new code, but it's now going to be exposed to
more AMs, so I think there's a good argument to fix it now.

Robert: You committed that addition, in

commit f35742ccb7aa53ee3ed8416bbb378b0c3eeb6bb9
Author: Robert Haas <rhaas(at)postgresql(dot)org>
Date: 2017-03-08 12:05:43 -0500

Support parallel bitmap heap scans.

do you remember why that's done?

> + * HEIKKI: A flags bitmask argument would be more readable than 6 booleans
> */
> TableScanDesc (*scan_begin) (Relation rel,
> Snapshot snapshot,

I honestly don't have strong feelings about it. Not sure that I buy that
bitmasks would be much more readable - but perhaps we could just use the
struct trickery we started to use in

commit f831d4accda00b9144bc647ede2e2f848b59f39d
Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Date: 2019-02-01 11:29:42 -0300

Add ArchiveOpts to pass options to ArchiveEntry

> @@ -194,6 +205,9 @@ typedef struct TableAmRoutine
> /*
> * Release resources and deallocate scan. If TableScanDesc.temp_snap,
> * TableScanDesc.rs_snapshot needs to be unregistered.
> + *
> + * HEIKKI: I find this 'temp_snap' thing pretty weird. Can't the caller handle
> + * deregistering it?
> */
> void (*scan_end) (TableScanDesc scan);

It's old logic, just wrapped new. I think there's some argument that
some of this should be moved to tableam.c rather than the individual
AMs.

> @@ -221,6 +235,11 @@ typedef struct TableAmRoutine
> /*
> * Estimate the size of shared memory needed for a parallel scan of this
> * relation. The snapshot does not need to be accounted for.
> + *
> + * HEIKKI: If this returns X, then the parallelscan_initialize() call
> + * mustn't use more than X. So this is not just for optimization purposes,
> + * for example. Not sure how to phrase that, but could use some
> + * clarification.
> */
> Size (*parallelscan_estimate) (Relation rel);

Hm. I thought I'd done that by adding the note about the amount of space
parallelscan_initialize() getting memory sized by
parallelscan_estimate().

> /*
> * Reset index fetch. Typically this will release cross index fetch
> * resources held in IndexFetchTableData.
> + *
> + * HEIKKI: Is this called between every call to index_fetch_tuple()?
> + * Between every call to index_fetch_tuple(), except when call_again is
> + * set? Can it be a no-op?
> */
> void (*index_fetch_reset) (struct IndexFetchTableData *data);

It's basically just to release resources eagerly. I'll add a note.

> @@ -272,19 +297,22 @@ typedef struct TableAmRoutine
> * test, return true, false otherwise.
> *
> * Note that AMs that do not necessarily update indexes when indexed
> - * columns do not change, need to return the current/correct version of
> + * columns don't change, need to return the current/correct version of
> * the tuple that is visible to the snapshot, even if the tid points to an
> * older version of the tuple.
>
> * *call_again is false on the first call to index_fetch_tuple for a tid.
> - * If there potentially is another tuple matching the tid, *call_again
> - * needs be set to true by index_fetch_tuple, signalling to the caller
> + * If there potentially is another tuple matching the tid, the callback
> + * needs to set *call_again to true, signalling to the caller
> * that index_fetch_tuple should be called again for the same tid.
> *
> * *all_dead, if all_dead is not NULL, should be set to true by
> * index_fetch_tuple iff it is guaranteed that no backend needs to see
> - * that tuple. Index AMs can use that do avoid returning that tid in
> + * that tuple. Index AMs can use that to avoid returning that tid in
> * future searches.
> + *
> + * HEIKKI: Should the snapshot be given in index_fetch_begin()? Can it
> + * differ between calls?
> */
> bool (*index_fetch_tuple) (struct IndexFetchTableData *scan,
> ItemPointer tid,

Hm. It could very well differ between calls. E.g. _bt_check_unique()
could benefit from that (although it currently uses the
table_index_fetch_tuple_check() wrapper), as it does one lookup with
SnapshotDirty, and then the next with SnapshotSelf.

> @@ -302,6 +330,8 @@ typedef struct TableAmRoutine
> * Fetch tuple at `tid` into `slot`, after doing a visibility test
> * according to `snapshot`. If a tuple was found and passed the visibility
> * test, returns true, false otherwise.
> + *
> + * HEIKKI: explain how this differs from index_fetch_tuple.
> */
> bool (*tuple_fetch_row_version) (Relation rel,
> ItemPointer tid,

Currently the wrapper has:

* See table_index_fetch_tuple's comment about what the difference between
* these functions is. This function is the correct to use outside of
* index entry->table tuple lookups.

referencing

* The difference between this function and table_fetch_row_version is that
* this function returns the currently visible version of a row if the AM
* supports storing multiple row versions reachable via a single index entry
* (like heap's HOT). Whereas table_fetch_row_version only evaluates the
* tuple exactly at `tid`. Outside of index entry ->table tuple lookups,
* table_fetch_row_version is what's usually needed.

should we just duplicate that?

> @@ -311,14 +341,17 @@ typedef struct TableAmRoutine
> /*
> * Return the latest version of the tuple at `tid`, by updating `tid` to
> * point at the newest version.
> + *
> + * HEIKKI: the latest version visible to the snapshot?
> */
> void (*tuple_get_latest_tid) (Relation rel,
> Snapshot snapshot,
> ItemPointer tid);

It's such a bad interface :(. I'd love to just remove it. Based on

https://www.postgresql.org/message-id/17ef5a8a-71cb-5cbf-1762-dbb71626f84e%40dream.email.ne.jp

I think we can basically just remove currtid_byreloid/byrelname. I've
not sufficiently thought about TidNext() yet.

> /*
> - * Does the tuple in `slot` satisfy `snapshot`? The slot needs to be of
> - * the appropriate type for the AM.
> + * Does the tuple in `slot` satisfy `snapshot`?
> + *
> + * The AM may modify the data underlying the tuple as a side-effect.
> */
> bool (*tuple_satisfies_snapshot) (Relation rel,
> TupleTableSlot *slot,

Hm, this obviously should be moved here from the wrapper. But I now
wonder if we can't phrase this better. Might try to come up with
something.

> + /*
> + * Copy all data from `OldHeap` into `NewHeap`, as part of a CLUSTER or
> + * VACUUM FULL.
> + *
> + * If `OldIndex` is valid, the data should be ordered according to the
> + * given index. If `use_sort` is false, the data should be fetched from the
> + * index, otherwise it should be fetched from the old table and sorted.
> + *
> + * OldestXmin, FreezeXid, MultiXactCutoff are currently valid values for
> + * the table.
> + * HEIKKI: What does "currently valid" mean? Valid for the old table?

They are system-wide values, basically. Not sure into how much detail
about that to go her?

> + * The callback should set *num_tuples, *tups_vacuumed, *tups_recently_dead
> + * to statistics computed while copying for the relation. Not all might make
> + * sense for every AM.
> + * HEIKKI: What to do for the ones that don't make sense? Set to 0?
> + */

I don't see much of an alternative, yea. I suspect we're going to have
to expand vacuum's reporting once we have a better grasp about what
other AMs want / need.

> /*
> * Prepare to analyze block `blockno` of `scan`. The scan has been started
> - * with table_beginscan_analyze(). See also
> - * table_scan_analyze_next_block().
> + * with table_beginscan_analyze().
> *
> * The callback may acquire resources like locks that are held until
> - * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
> + * table_scan_analyze_next_tuple() returns false. For example, it can make sense
> * to hold a lock until all tuples on a block have been analyzed by
> * scan_analyze_next_tuple.
> + * HEIKKI: Hold a lock on what? A lwlock on the page?

Yea, that's what heapam does. I'm not particularly happy with this, but
I'm not sure how to do better. I expect that we'll have to revise this
to be more general at some not too far away point.

> @@ -618,6 +666,8 @@ typedef struct TableAmRoutine
> * internally needs to perform mapping between the internal and a block
> * based representation.
> *
> + * HEIKKI: What TsmRoutine? Where is that?

I'm not sure what you mean. The SampleScanState has it's associated
tablesample routine. Would saying something like "will call the
NextSampleBlock() callback for the TsmRoutine associated with the
SampleScanState" be better?

> /*
> * Like table_beginscan(), but table_beginscan_strat() offers an extended API
> - * that lets the caller control whether a nondefault buffer access strategy
> - * can be used, and whether syncscan can be chosen (possibly resulting in the
> - * scan not starting from block zero). Both of these default to true with
> - * plain table_beginscan.
> + * that lets the caller to use a non-default buffer access strategy, or
> + * specify that a synchronized scan can be used (possibly resulting in the
> + * scan not starting from block zero). Both of these default to true, as
> + * with plain table_beginscan.
> + *
> + * HEIKKI: I'm a bit confused by 'allow_strat'. What is the non-default
> + * strategy that will get used if you pass allow_strat=true? Perhaps the flag
> + * should be called "use_bulkread_strategy"? Or it should be of type
> + * BufferAccessStrategyType, or the caller should create a strategy with
> + * GetAccessStrategy() and pass that.
> */

That's really just a tableam port of the pre-existing heapam interface.
I don't like the API very much, but there's only so much things that
were realistic to change during this project (I think, there were
obviously lots of judgement calls). I don't think there's much reason
to defend the current status - and I'm happy to collaborate on fixing
that. But I think it's a out of scope for 12.

> /*
> - * table_beginscan_sampling is an alternative entry point for setting up a
> + * table_beginscan_sampling() is an alternative entry point for setting up a
> * TableScanDesc for a TABLESAMPLE scan. As with bitmap scans, it's worth
> * using the same data structure although the behavior is rather different.
> * In addition to the options offered by table_beginscan_strat, this call
> * also allows control of whether page-mode visibility checking is used.
> + *
> + * HEIKKI: What is 'pagemode'?
> */

That's a good question. My not defining it is pretty much a cop-out,
because there previously wasn't any explanation, and I wasn't sure there
*is* a meaningful definition. I mean, it's basically largely an
efficiency hack inside heapam.c, but it's currently externally
determined e.g. in bernoulli.c (code from 11):

* Use bulkread, since we're scanning all pages. But pagemode visibility
* checking is a win only at larger sampling fractions. The 25% cutoff
* here is based on very limited experimentation.
*/
node->use_bulkread = true;
node->use_pagemode = (percent >= 25);

If you have a suggestion how to either get rid of it, or how to properly
phrase this...

> * TABLE_INSERT_NO_LOGICAL force-disables the emitting of logical decoding
> * information for the tuple. This should solely be used during table rewrites
> * where RelationIsLogicallyLogged(relation) is not yet accurate for the new
> * relation.
> + * HEIKKI: Is this optional, too? Can the AM ignore it?

Hm. Currently logical decoding isn't really extensible automatically to
an AM (it works via WAL and WAL isn't extensible) - so it'll currently
not mean anything to non-heap AMs (or AMs that patch/are part of core).

> * Note that most of these options will be applied when inserting into the
> * heap's TOAST table, too, if the tuple requires any out-of-line data.
> @@ -1041,6 +1100,8 @@ table_compute_xid_horizon_for_tuples(Relation rel,
> * On return the slot's tts_tid and tts_tableOid are updated to reflect the
> * insertion. But note that any toasting of fields within the slot is NOT
> * reflected in the slots contents.
> + *
> + * HEIKKI: I think GetBulkInsertState() should be an AM-specific callback.
> */

I agree. There was some of that in an earlier version of the patch, but
the interface wasn't yet right. I think there's a lot such things that
just need to be added incrementally.

> @@ -1170,6 +1235,9 @@ table_delete(Relation rel, ItemPointer tid, CommandId cid,
> * update was done. However, any TOAST changes in the new tuple's
> * data are not reflected into *newtup.
> *
> + * HEIKKI: There is no 'newtup'.
> + * HEIKKI: HEAP_ONLY_TUPLE is AM-specific; do the callers peek into that, currently?

No, callers currently don't. The callback does, and sets
*update_indexes accordingly.

> - * A side effect is to set indexInfo->ii_BrokenHotChain to true if we detect
> + * A side effect is to set index_info->ii_BrokenHotChain to true if we detect
> * any potentially broken HOT chains. Currently, we set this if there are any
> * RECENTLY_DEAD or DELETE_IN_PROGRESS entries in a HOT chain, without trying
> * very hard to detect whether they're really incompatible with the chain tip.
> * This only really makes sense for heap AM, it might need to be generalized
> * for other AMs later.
> + *
> + * HEIKKI: What does 'allow_sync' do?

Heh, I'm going to be responsible for everything that was previously
undocumented, aren't I ;). I guess we should say something vague like
"When allow_sync is set to true, an AM may use scans synchronized with
other backends, if that makes sense. For some AMs that determines
whether tuples are going to be returned in TID order".
It's vague, but I'm not sure we can do better.

Thanks!

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-04-11 16:52:14 Re: setLastTid() and currtid()
Previous Message Pavel Stehule 2019-04-11 16:31:19 Re: [HACKERS][Proposal] LZ4 Compressed Storage Manager