Quick Links

Re: Pluggable Storage - Andres's take

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Pluggable Storage - Andres's take
Date:	2018-11-27 05:58:35
Message-ID:	080ce65e-7b96-adbf-1c8c-7c88d87eaeda@lab.ntt.co.jp
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2018/11/02 9:17, Haribabu Kommi wrote:
> Here I attached the cumulative fixes of the patches, new API additions for
> zheap and
> basic outline of the documentation.

I've read the documentation patch while also looking at the code and here
are some comments.

+ Each table is stored as its own physical
<firstterm>relation</firstterm> and so
+ is described by an entry in the <structname>pg_class</structname> catalog.

I think the "so" in "and so is described by an entry in..." is not necessary.

+ The contents of an table are entirely under the control of its access
method.

"a" table

+ (All the access methods furthermore use the standard page layout
described in
+ <xref linkend="storage-page-layout"/>.)

Maybe write the two sentences above as:

A table's content is entirely controlled by its access method, although
all access methods use the same standard page layout described in <xref
linkend="storage-page-layout"/>.

+ SlotCallbacks_function slot_callbacks;
+
+ SnapshotSatisfies_function snapshot_satisfies;
+ SnapshotSatisfiesUpdate_function snapshot_satisfiesUpdate;
+ SnapshotSatisfiesVacuum_function snapshot_satisfiesVacuum;

Like other functions, how about a one sentence comment for these, like:

/*
* Function to get an AM-specific set of functions for manipulating
* TupleTableSlots
*/
SlotCallbacks_function slot_callbacks;

/* AM-specific snapshot visibility determination functions */
SnapshotSatisfies_function snapshot_satisfies;
SnapshotSatisfiesUpdate_function snapshot_satisfiesUpdate;
SnapshotSatisfiesVacuum_function snapshot_satisfiesVacuum;

+ TupleFetchFollow_function tuple_fetch_follow;
+
+ GetTupleData_function get_tuple_data;

How about removing the empty line so that get_tuple_data can be seen as
part of the group /* Operations on physical tuples */

+ RelationVacuum_function relation_vacuum;
+ RelationScanAnalyzeNextBlock_function scan_analyze_next_block;
+ RelationScanAnalyzeNextTuple_function scan_analyze_next_tuple;
+ RelationCopyForCluster_function relation_copy_for_cluster;
+ RelationSync_function relation_sync;

Add /* Operations to support VACUUM/ANALYZE */ as a description for this
group?

+ BitmapPagescan_function scan_bitmap_pagescan;
+ BitmapPagescanNext_function scan_bitmap_pagescan_next;

Add /* Operations to support bitmap scans */ as a description for this group?

+ SampleScanNextBlock_function scan_sample_next_block;
+ SampleScanNextTuple_function scan_sample_next_tuple;

Add /* Operations to support sampling scans */ as a description for this
group?

+ ScanEnd_function scan_end;
+ ScanRescan_function scan_rescan;
+ ScanUpdateSnapshot_function scan_update_snapshot;

Move these two to be in the /* Operations on relation scans */ group?

+ BeginIndexFetchTable_function begin_index_fetch;
+ EndIndexFetchTable_function reset_index_fetch;
+ EndIndexFetchTable_function end_index_fetch;

Add /* Operations to support index scans */ as a description for this group?

+ IndexBuildRangeScan_function index_build_range_scan;
+ IndexValidateScan_function index_validate_scan;

Add /* Operations to support index build */ as a description for this group?

+ CreateInitFork_function CreateInitFork;

Add /* Function to create an init fork for unlogged tables */?

By the way, I can see the following two in the source code, but not in the
documentation.

EstimateRelSize_function EstimateRelSize;
SetNewFileNode_function SetNewFileNode;

+ The table construction and maintenance functions that an table access
+ method must provide in <structname>TableAmRoutine</structname> are:

"a" table access method

+ <para>
+<programlisting>
+TupleTableSlotOps *
+slot_callbacks (Relation relation);
+</programlisting>
+ API to access the slot specific methods;
+ Following methods are available;
+ <structname>TTSOpsVirtual</structname>,
+ <structname>TTSOpsHeapTuple</structname>,
+ <structname>TTSOpsMinimalTuple</structname>,
+ <structname>TTSOpsBufferTuple</structname>,
+ </para>

Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or
its relations to the TableAmRoutine abstraction, I think the text
description could better be written as:

"API to get the slot operations struct for a given table access method"

It's not clear to me why various TTSOps* structs are listed here? Is the
point that different AMs may choose one of the listed alternatives? For
example, I see that heap AM implementation returns TTOpsBufferTuple, so it
manipulates slots containing buffered tuples, right? Other AMs are free
to return any one of these? For example, some AMs may never use buffer
manager and hence not use TTOpsBufferTuple. Is that understanding correct?

+ <para>
+<programlisting>
+bool
+snapshot_satisfies (TupleTableSlot *slot, Snapshot snapshot);
+</programlisting>
+ API to check whether the provided slot is visible to the current
+ transaction according the snapshot.
+ </para>

Do you mean:

"API to check whether the tuple contained in the provided slot is
visible...."?

+ <para>
+<programlisting>
+Oid
+tuple_insert (Relation rel, TupleTableSlot *slot, CommandId cid,
+ int options, BulkInsertState bistate);
+</programlisting>
+ API to insert the tuple and provide the <literal>ItemPointerData</literal>
+ where the tuple is successfully inserted.
+ </para>

It's not clear from the signature where you get the ItemPointerData.
Looking at heapam_tuple_insert which puts it in slot->tts_tid, I think
this should mention it a bit differently, like:

API to insert the tuple contained in the provided slot and return its TID,
that is, the location where the tuple is successfully inserted

+ API to insert the tuple with a speculative token. This API is similar
+ like <literal>tuple_insert</literal>, with additional speculative
+ information.

How about:

This API is similar to <literal>tuple_insert</literal>, although with
additional information necessary for speculative insertion

+ <para>
+<programlisting>
+void
+tuple_complete_speculative (Relation rel,
+ TupleTableSlot *slot,
+ uint32 specToken,
+ bool succeeded);
+</programlisting>
+ API to complete the state of the tuple inserted by the API
<literal>tuple_insert_speculative</literal>
+ with the successful completion of the index insert.
+ </para>

How about:

API to complete the speculative insertion of a tuple started by
<literal>tuple_insert_speculative</literal>, invoked after finishing the
index insert

+ <para>
+<programlisting>
+bool
+tuple_fetch_row_version (Relation relation,
+ ItemPointer tid,
+ Snapshot snapshot,
+ TupleTableSlot *slot,
+ Relation stats_relation);
+</programlisting>
+ API to fetch and store the Buffered Heap tuple in the provided slot
+ based on the ItemPointer.
+ </para>

It seems that this description is based on what heapam_fetch_row_version()
does, but it should be more generic, maybe like:

API to fetch a buffered tuple given its TID and store it in the provided slot

+ <para>
+<programlisting>
+HTSU_Result
+TupleLock_function (Relation relation,
+ ItemPointer tid,
+ Snapshot snapshot,
+ TupleTableSlot *slot,
+ CommandId cid,
+ LockTupleMode mode,
+ LockWaitPolicy wait_policy,
+ uint8 flags,
+ HeapUpdateFailureData *hufd);

I guess you meant to write "tuple_lock" here, not "TupleLock_function".

+</programlisting>
+ API to lock the specified the ItemPointer tuple and fetches the newest
version of
+ its tuple and TID.
+ </para>

How about:

API to lock the specified tuple and return the TID of its newest version

+ <para>
+<programlisting>
+void
+tuple_get_latest_tid (Relation relation,
+ Snapshot snapshot,
+ ItemPointer tid);
+</programlisting>
+ API to get the the latest TID of the tuple with the given itempointer.
+ </para>

How about:

API to get TID of the latest version of the specified tuple

+ <para>
+<programlisting>
+bool
+tuple_fetch_follow (struct IndexFetchTableData *scan,
+ ItemPointer tid,
+ Snapshot snapshot,
+ TupleTableSlot *slot,
+ bool *call_again, bool *all_dead);
+</programlisting>
+ API to get the all the tuples of the page that satisfies itempointer.
+ </para>

IIUC, "all the tuples of of the page" in the above sentence means all the
tuples in the HOT chain of a given heap tuple, making this description of
the API slightly specific to the heap AM. Can we make the description
more generic or is the API itself very specific that it cannot be
expressed in generic terms? Ignoring that for a moment, I think the
sentence contains more "the"s than there need to be, so maybe write as:

API to get all tuples on a given page that are linked to the tuple of the
given TID

+ <para>
+<programlisting>
+tuple_data
+get_tuple_data (TupleTableSlot *slot, tuple_data_flags flags);
+</programlisting>
+ API to return the internal structure members of the HeapTuple.
+ </para>

I think this description doesn't mention enough details of both the
information that needs to be specified when calling the function (what's
in flags) and the information that's returned.

+ <para>
+<programlisting>
+bool
+scan_analyze_next_tuple (TableScanDesc scan, TransactionId OldestXmin,
+ double *liverows, double *deadrows, TupleTableSlot
*slot));
+</programlisting>
+ API to analyze the block and fill the buffered heap tuple in the slot
and also
+ provide the live and dead rows.
+ </para>

How about:

API to get the next tuple from the block being scanned, which also updates
the number of live and dead rows encountered

+ <para>
+<programlisting>
+void
+relation_copy_for_cluster (Relation NewHeap, Relation OldHeap, Relation
OldIndex,
+ bool use_sort,
+ TransactionId OldestXmin, TransactionId FreezeXid,
MultiXactId MultiXactCutoff,
+ double *num_tuples, double *tups_vacuumed, double
*tups_recently_dead);
+</programlisting>
+ API to copy one relation to another relation eith using the Index or
table scan.
+ </para>

Typo: eith -> either

But maybe, rewrite this as:

API to make a copy of the content of a relation, optionally sorted using
either the specified index or by sorting explicitly

+ <para>
+<programlisting>
+TableScanDesc
+scan_begin (Relation relation,
+ Snapshot snapshot,
+ int nkeys, ScanKey key,
+ ParallelTableScanDesc parallel_scan,
+ bool allow_strat,
+ bool allow_sync,
+ bool allow_pagemode,
+ bool is_bitmapscan,
+ bool is_samplescan,
+ bool temp_snap);
+</programlisting>
+ API to start the relation scan for the provided relation and returns the
+ <structname>TableScanDesc</structname> structure.
+ </para>

How about:

API to start a scan of a relation using specified options, which returns
the <structname>TableScanDesc</structname> structure to be used for
subsequent scan operations

+ <para>
+<programlisting>
+void
+scansetlimits (TableScanDesc sscan, BlockNumber startBlk, BlockNumber
numBlks);
+</programlisting>
+ API to fix the relation scan range limits.
+ </para>

How about:

API to set scan range endpoints

+ <para>
+<programlisting>
+bool
+scan_bitmap_pagescan (TableScanDesc scan,
+ TBMIterateResult *tbmres);
+</programlisting>
+ API to scan the relation and fill the scan description bitmap with
valid item pointers
+ for the specified block.
+ </para>

This says "to scan the relation", but seems to be concerned with only a
page worth of data as the name also says. Also, it's not clear what "scan
description bitmap" means. Maybe write as:

API to scan the relation block specified in the scan descriptor to collect
and return the tuples requested by the given bitmap

+ <para>
+<programlisting>
+bool
+scan_bitmap_pagescan_next (TableScanDesc scan,
+ TupleTableSlot *slot);
+</programlisting>
+ API to fill the buffered heap tuple data from the bitmap scanned item
pointers and store
+ it in the provided slot.
+ </para>

How about:

API to select the next tuple from the set of tuples of a given page
specified in the scan descriptor and return in the provided slot; returns
false if no more tuples to return on the given page

+ <para>
+<programlisting>
+bool
+scan_sample_next_block (TableScanDesc scan, struct SampleScanState
*scanstate);
+</programlisting>
+ API to scan the relation and fill the scan description bitmap with
valid item pointers
+ for the specified block provided by the sample method.
+ </para>

Looking at the code, this API selects the next block using the sampling
method and nothing more, although I see that the heap AM implementation
also does heapgetpage thus collecting live tuples in the array known only
to heap AM. So, how about:

API to select the next block of the relation using the given sampling
method and set its information in the scan descriptor

+ <para>
+<programlisting>
+bool
+scan_sample_next_tuple (TableScanDesc scan, struct SampleScanState
*scanstate, TupleTableSlot *slot);
+</programlisting>
+ API to fill the buffered heap tuple data from the bitmap scanned item
pointers based on the sample
+ method and store it in the provided slot.
+ </para>

How about:

API to select the next tuple using the given sampling method from the set
of tuples collected from the block previously selected by the sampling method

+ <para>
+<programlisting>
+void
+scan_rescan (TableScanDesc scan, ScanKey key, bool set_params,
+ bool allow_strat, bool allow_sync, bool allow_pagemode);
+</programlisting>
+ API to restart the relation scan with provided data.
+ </para>

How about:

API to restart the given scan using provided options, releasing any
resources (such as buffer pins) already held by the scan

+ <para>
+<programlisting>
+void
+scan_update_snapshot (TableScanDesc scan, Snapshot snapshot);
+</programlisting>
+ API to update the relation scan with the new snapshot.
+ </para>

How about:

API to set the visibility snapshot to be used by a given scan

+ <para>
+<programlisting>
+IndexFetchTableData *
+begin_index_fetch (Relation relation);
+</programlisting>
+ API to prepare the <structname>IndexFetchTableData</structname> for
the relation.
+ </para>

This API is a bit vague. As in, it's not clear from the name when it's to
be called and what's be to be done with the returned struct. How about at
least adding more details about what the returned struct is for, like:

API to get the <structname>IndexFetchTableData</structname> to be assigned
to an index scan on the specified relation

+ <para>
+<programlisting>
+void
+reset_index_fetch (struct IndexFetchTableData* data);
+</programlisting>
+ API to reset the prepared internal members of the
<structname>IndexFetchTableData</structname>.
+ </para>

This description seems wrong if I look at the code. Its purpose seems to
be reset the AM-specific members, such as releasing the buffer pin held in
xs_cbuf in the heap AM's case.

How about:

API to release AM-specific resources held by the
<structname>IndexFetchTableData</structname> of a given index scan

+ <para>
+<programlisting>
+void
+end_index_fetch (struct IndexFetchTableData* data);
+</programlisting>
+ API to clear and free the <structname>IndexFetchTableData</structname>.
+ </para>

Given above, how about:

API to release AM-specific resources held by the
<structname>IndexFetchTableData</structname> of a given index scan and
free the memory of <structname>IndexFetchTableData</structname> itself

+ <para>
+<programlisting>
+double
+index_build_range_scan (Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ bool anyvisible,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state,
+ TableScanDesc scan);
+</programlisting>
+ API to perform the table scan with bounded range specified by the caller
+ and insert the satisfied records into the index using the provided
callback
+ function pointer.
+ </para>

This is a bit heavy API and the above description lacks some details.
Also, isn't it a bit misleading to use the name end_blockno if it is
interpreted as num_blocks by the internal APIs?

How about:

API to scan the specified blocks of the given table and insert them into
the specified index using the provided callback function

+ <para>
+<programlisting>
+void
+index_validate_scan (Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ Snapshot snapshot,
+ struct ValidateIndexState *state);
+</programlisting>
+ API to perform the table scan and insert the satisfied records into
the index.
+ This API is similar like <function>index_build_range_scan</function>.
This
+ is used in the scenario of concurrent index build.
+ </para>

This one's a complicated API too. How about:

API to scan the table according to the given snapshot and insert tuples
satisfying the snapshot into the specified index, provided their TIDs are
also present in the <structname>ValidateIndexState</structname> struct;
this API is used as the last phase of a concurrent index build

+ <sect2>
+ <title>Table scanning</title>
+
+ <para>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Table insert/update/delete</title>
+
+ <para>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Table locking</title>
+
+ <para>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Table vacuum</title>
+
+ <para>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Table fetch</title>
+
+ <para>
+ </para>
+ </sect2>

Seems like you forgot to put the individual API descriptions under these
sub-headers. Actually, I think it'd be better to try to format this page
to looks more like the following:

https://www.postgresql.org/docs/devel/fdw-callbacks.html

- Currently, only indexes have access methods. The requirements for index
- access methods are discussed in detail in <xref linkend="indexam"/>.
+ Currently, only <literal>INDEX</literal> and <literal>TABLE</literal> have
+ access methods. The requirements for access methods are discussed in
detail
+ in <xref linkend="am"/>.

Hmm, I don't see why you decided to add literal tags to INDEX and TABLE.
Couldn't this have been written as:

Currently, only tables and indexes have access methods. The requirements
for access methods are discussed in detail in <xref linkend="am"/>.

+ This variable specifies the default table access method using
which to create
+ objects (tables and materialized views) when a
<command>CREATE</command> command does
+ not explicitly specify a access method.

"variable" is not wrong, but "parameter" is used more often for GUCs. "a
access method" should be "an access method".

Maybe you could write this as:

This variable specifies the default table access method to use when
creating tables or materialized views if the <command>CREATE</command>
does not explicitly specify an access method.

+ If the value does not match the name of any existing table access
methods,
+ <productname>PostgreSQL</productname> will automatically use the
default
+ table access method of the current database.

any existing table methods -> any existing table method

Although, shouldn't that cause an error instead of ignoring the error and
use the database default access method instead?

Thank you for working on this. Really looking forward to how this shapes
up. :)

Thanks,
Amit

In response to

Re: Pluggable Storage - Andres's take at 2018-11-02 00:17:29 from Haribabu Kommi

Responses

Re: Pluggable Storage - Andres's take at 2018-12-10 16:14:34 from Dmitry Dolgov
Re: Pluggable Storage - Andres's take at 2018-12-13 05:59:00 from Kyotaro HORIGUCHI
Re: Pluggable Storage - Andres's take at 2019-02-20 03:44:21 from Haribabu Kommi

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Langote	2018-11-27 06:00:20	Re: Inadequate executor locking of indexes
Previous Message	Laurenz Albe	2018-11-27 05:33:42	Re: pg_config wrongly marked as not parallel safe?