Re: Pluggable storage

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Alvaro Herrera <alvherre(at)2ndQuadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Pluggable storage
Date: 2016-08-17 13:01:20
Message-ID: be2980a3-1a46-4076-240a-915dff5ac39b@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

13.08.2016 02:15, Alvaro Herrera:
> Many have expressed their interest in this topic, but I haven't seen any
> design of how it should work. Here's my attempt; I've been playing with
> this for some time now and I think what I propose here is a good initial
> plan. This will allow us to write permanent table storage that works
> differently than heapam.c. At this stage, I haven't throught through
> whether this is going to allow extensions to define new storage modules;
> I am focusing on AMs that can coexist with heapam in core.
>
> The design starts with a new row type in pg_am, of type "s" (for "storage").
> The handler function returns a struct of node StorageAmRoutine. This
> contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples
> (tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the
> like), and operations on tuples that are part of slots (tuple_deform,
> materialize).
>
> To support this, we introduce StorageTuple and StorageScanDesc.
> StorageTuples represent a physical tuple coming from some storage AM.
> It is necessary to have a pointer to a StorageAmRoutine in order to
> manipulate the tuple. For heapam.c, a StorageTuple is just a HeapTuple.

StorageTuples concept looks really cool. I've got some questions on
details of implementation.

Do StorageTuples have fields common to all implementations?
Or StorageTuple is totally abstract structure that has nothing to do
with data, except pointing to it?

I mean, now we already have HeapTupleData structure, which is a pretty
good candidate to replace with StorageTuple.
It's already widely used in executor and moreover, it's the only structure
(except MinimalTuples and all those crazy optimizations) that works with
tuples, both extracted from the page or created on-the-fly in executor node.

typedef struct HeapTupleData
{
uint32 t_len; /* length of *t_data */
ItemPointerData t_self; /* SelfItemPointer */
Oid t_tableOid; /* table the tuple came from */
HeapTupleHeader t_data; /* -> tuple header and data */
} HeapTupleData;

We can simply change t_data type from HeapTupleHeader to Pointer.
And maybe add a "t_handler" field that points out to handler functions.
I don't sure if it will be a name of StorageAm, or its OID, or maybe the
main function itself. Although, If I'm not mistaken, we always have
RelationData when we want to operate the tuple, so having t_handler
in the StorageTuple is excessive.

typedef struct StorageTupleData
{
uint32 t_len; /* length of *t_data */
ItemPointerData t_self; /* SelfItemPointer */
Oid t_tableOid; /* table the tuple came from */
Pointer t_data; /* -> tuple header and data
* This field should never be
accessed directly,
* only via StorageAm handler
functions,
* because we don't know
underlying data structure.
*/
??? t_handler; /* StorageAm that knows what to do
with the tuple */
} StorageTupleData
;

This approach allows to minimize code changes and ensure that we
won't miss any function that handles tuples.

Do you see any weak points of the suggestion?
What design do you use in your prototype?

> RelationData gains ->rd_stamroutine which is a pointer to the
> StorageAmRoutine for the relation in question. Similarly,
> TupleTableSlot is augmented with a link to the StorageAmRoutine to
> handle the StorageTuple it contains (probably in most cases it's set at
> the same time as the tupdesc). This implies that routines such as
> ExecAssignScanType need to pass down the StorageAmRoutine from the
> relation to the slot.

If we already have this pointer in t_handler as described below,
we don't need to pass it between functions and slots.
> The executor is modified so that instead of calling heap_insert etc
> directly, it uses rel->rd_stamroutine to call these methods. The
> executor is still in charge of dealing with indexes, constraints, and
> any other thing that's not the tuple storage itself (this is one major
> point in which this differs from FDWs). This all looks simple enough,
> with one exception and a few notes:

That is exactly what I tried to describe in my proposal.
Chapter "Relation management". I'm sure, you've already noticed
that it will require huge source code cleaning. I've carefully read
the sources and found "violators" of abstraction in src/backend/commands.
The list is attached to the wiki page
https://wiki.postgresql.org/wiki/HeapamRefactoring.

Except these, there are some pretty strange and unrelated functions in
src/backend/catalog.
I'm willing to fix them, but I'd like to synchronize our efforts.

> exception a) ExecMaterializeSlot needs special consideration. This is
> used in two different ways: a1) is the stated "make tuple independent
> from any underlying storage" point, which is handled by
> ExecMaterializeSlot itself and calling a method from the storage AM to
> do any byte copying as needed. ExecMaterializeSlot no longer returns a
> HeapTuple, because there might not be any. The second usage pattern a2)
> is to create a HeapTuple that's passed to other modules which only deal
> with HT and not slots (triggers are the main case I noticed, but I think
> there are others such as the executor itself wanting tuples as Datum for
> some reason). For the moment I'm handling this by having a new
> ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of
> the original tuple format.

Yes, triggers are a very special case. Thank you for the explanation.

That still goes well with my suggestion of a format.
Nothing to do, just substitute t_data with proper HeapTupleHeader
representation. I think it's a job for StorageAm. Let's say each StorageAm
must have stam_to_heaptuple() function and opposite function
stam_from_heaptuple().

> note b) EvalPlanQual currently maintains an array of HeapTuple in
> EState->es_epqTuple. I think it works to replace that with an array of
> StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine
> methods in order to interact with it. Other than those changes, it
> seems okay.
>
> note c) nodeSubplan has curTuple as a HeapTuple. It seems simple
> to replace this with an independent slot-based tuple.
>
> note d) grp_firstTuple in nodeAgg / nodeSetOp. These are less
> simple than the above, but replacing the HeapTuple with a slot-based
> tuple seems doable too.
>
> note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual.
> TupleTableSlot also seems a good replacement. This has fallout in other
> users of EvalPlanQual, too.
>
> note f) More widespread, MinimalTuples currently use a tweaked HeapTuple
> format. In the long run, it may be possible to replace them with a
> separate storage module that's specifically designed to handle tuples
> meant for tuplestores etc. That may simplify TupleTableSlot and
> execTuples. For the moment we keep the tts_mintuple as it is. Whenever
> a tuple is not already in heap format, we heapify it in order to put in
> the store.
I wonder, do we really need MinimalTuples to support all formats?

> The current heapam.c routines need some changes. Currently, practice is
> that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on
> their input tuples to set the resulting ItemPointer in tuple->t_self.
> This is messy if we want StorageTuples to be abstract. I'm changing
> this so that the resulting ItemPointer is returned in a separate output
> argument; the tuple itself is left alone. This is somewhat messy in the
> case of heap_multi_insert because it returns several items; I think it's
> acceptable to return an array of ItemPointers in the same order as the
> input tuples. This works fine for the only caller, which is COPY in
> batch mode. For the other routines, they don't really care where the
> TID is returned AFAICS.
>
>
> Additional noteworthy items:
>
> i) Speculative insertion: the speculative insertion token is no longer
> installed directly in the heap tuple by the executor (of course).
> Instead, the token becomes part of the slot. When the tuple_insert
> method is called, the insertion routine is in charge of setting the
> token from the slot into the storage tuple. Executor is in charge of
> calling method->speculative_finish() / abort() once the insertion has
> been confirmed by the indexes.
>
> ii) execTuples has additional accessors for tuples-in-slot, such as
> ExecFetchSlotTuple and friends. I expect to have some of them to return
> abstract StorageTuples, others HeapTuple or MinimalTuples (possibly
> wrapped in Datum), depending on callers. We might be able to cut down
> on these later; my first cut will try to avoid API changes to keep
> fallout to a minimum.
I'd suggest replacing all occurrences of HeapTuple with StorageTuple.
Do you see any problems with it?

> iii) All tuples need to be identifiable by ItemPointers. Storages that
> have different requirements will need careful additional thought across
> the board.

For a start, we can simply deny secondary indexes for these storages
or require a function that converts tuple identifier inside the storage to
ItemPointer suitable for an index.

> iv) System catalogs cannot use pluggable storage. We continue to use
> heap_open etc in the DDL code, in order not to make this more invasive
> that it already is. We may lift this restriction later for specific
> catalogs, as needed.
+1
>
> v) Currently, one Buffer may be associated with one HeapTuple living in a
> slot; when the slot is cleared, the buffer pin is released. My current
> patch moves the buffer pin to inside the heapam-based storage AM and the
> buffer is released by the ->slot_clear_tuple method. The rationale for
> doing this is that some storage AMs might want to keep several buffers
> pinned at once, for example, and must not to release those pins
> individually but in batches as the scan moves forwards (say a batch of
> tuples in a columnar storage AM has column values spread across many
> buffers; they must all be kept pinned until the scan has moved past the
> whole set of tuples). But I'm not really sure that this is a great
> design.

Frankly, I doubt that it's real to implement columnar storage just as
a variant of pluggable storage. It requires a lot of changes in executor
and optimizer and so on, which are hardly compatible with existing
tuple-oriented model. However I'm not so good in this area, so if you
feel that it's possible, go ahead.

> I welcome comments on these ideas. My patch for this is nowhere near
> completion yet; expect things to change for items that I've overlooked,
> but I hope I didn't overlook any major. If things are handwavy, it is
> probably because I haven't fully figured them out yet.

Thank you again for beginning the big project.
Looking forward to the prototype. I think it will make the discussion
more concrete and useful.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-08-17 13:01:23 Re: Surprising behaviour of \set AUTOCOMMIT ON
Previous Message Amit Kapila 2016-08-17 12:50:31 Re: parallel.c is not marked as test covered