|From:||Andres Freund <andres(at)anarazel(dot)de>|
|To:||pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>|
|Cc:||Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
I've recently been discussing with Robert how to abstract
TupleTableSlots in the light of upcoming developments around abstracting
storage away. Besides that aspect I think we also need to make them
more abstract to later deal with vectorized excution, but I'm more fuzzy
on the details there.
I'd previously, in a response to an early version of the pluggable
storage patch, commented  how I think the abstraction should roughly
look like. I'd privately started to prototype that. After my discussion
with Robert I got that prototype in a state that can run many queries,
but doesn't pass the regression tests.
The pluggable storage patch has also been updated  to abstract slots
more (see patch 0005).
My position is that there's a number of weaknesses with the current
TupleTableSlot API in light of multiple independent, possibly out of
core, storage implementations:
- TupleTableSlots have to contain all the necessary information for each
type of slot. Some implementations might require a buffer pin to be
hold (our heap), others pointers to multiple underlying points of data
(columns store), some need additional metadata (our MinimalTuple).
Unifying the information into one struct makes it much harder to
provide differing implementations without modifying core postgres
and/or increasing overhead for every user of slots.
I think the consequence of that is that we need to allow each type of
slot to have their own TupleTableSlot embedding struct.
Haribabu's patch solved this by adding a tts_storage parameter that
contains additional data, but imo that's unconvincing due to the added
indirection in very performance critical codepaths.
- The way slots currently are implemented each new variant data is
stored in slots adds new branches to hot code paths like
ExecClearTuple(), and they're not extensible.
Therefore I think we need to move to callback based implementation. In
my tests, by moving the callback invocations into very thin inline
functions, the branch prediction accuracy actually goes sligthly
- Currently we frequently convert from one way of representing a row
into another, in the same slot. We do so fairly implicilty in
ExecMaterializeSlot(), ExecFetchSlotMinimalTuple(), ExecCopySlot()...
The more we abstract specific storage representation away, the worse I
think that is. I think we should make such conversions explicit by
requiring transfers between two slots if a specific type is required.
- We have a few codepaths that set fields on tuples that can't be
represented in virtual slots. E.g. postgres_fdw sets ctid, oid,
ExecProcessReturning() will set tableoid.
In my POC I turned TupleTableSlot into basically an abstract base class:
uint16 nvalid; /* # of valid values in tts_values */
const TupleTableSlotOps *const cb;
TupleDesc tupleDescriptor; /* slot's tuple descriptor */
Datum *values; /* current per-attribute values */
bool *nulls; /* current per-attribute null flags */
/* can we optimize away? */
MemoryContext mcxt; /* slot itself is in this context */
which then is inherited from for specific implementations of tuple
typedef struct HeapTupleTableSlot
HeapTuple tuple; /* physical tuple */
uint32 off; /* saved state for slot_deform_tuple */
typedef struct MinimalTupleTableSlot
HeapTuple tuple; /* tuple wrapper */
MinimalTuple mintuple; /* minimal tuple, or NULL if none */
HeapTupleData minhdr; /* workspace for minimal-tuple-only case */
uint32 off; /* saved state for slot_deform_tuple */
typedef struct TupleTableSlotOps
void (*init)(TupleTableSlot *slot);
void (*release)(TupleTableSlot *slot);
void (*clear)(TupleTableSlot *slot);
void (*getsomeattrs)(TupleTableSlot *slot, int natts);
void (*materialize)(TupleTableSlot *slot);
HeapTuple (*get_heap_tuple)(TupleTableSlot *slot);
MinimalTuple (*get_minimal_tuple)(TupleTableSlot *slot);
HeapTuple (*copy_heap_tuple)(TupleTableSlot *slot);
MinimalTuple (*copy_minimal_tuple)(TupleTableSlot *slot);
when creating a slot one has to specify the type of slot one wants to
typedef enum TupleTableSlotType
extern TupleTableSlot *MakeTupleTableSlot(TupleDesc desc, TupleTableSlotType st);
You might notice that I've renamed a few fields (tts_values -> values,
tts_isnull -> nulls, tts_nvalid -> nvalid) that haven't actually changed
in the above structs / the attached patch. I now think that's probably
not a good idea, unnecessarily breaking more code than necessary.
I generally like this scheme because it allows the TupleTableStruct to
be relatively lean, without knowing much about specific implementations
After this change functions like slot_getattr are thin inline wrappers:
+static inline Datum
+slot_getattr(TupleTableSlot *slot, int attnum, bool *isnull)
+ Assert(attnum > 0);
+ if (attnum > slot->nvalid)
+ slot->cb->getsomeattrs(slot, attnum);
+ Assert(attnum <= slot->nvalid);
+ *isnull = slot->nulls[attnum - 1];
+ return slot->values[attnum - 1];
Note that I've moved system attribute handling out of the slot_getattr
path - there were a few paths accessing them via slot_getattr, and it's
a lot of unnecessary branching.
The fact that slots of different types have different storage means that
we can't just change the type of a slot when needed. In nearly all
places that's not a problem. E.g. ExecHashTableInsert() converts the
"incoming" slot to one containing a minimal tuple with
ExecFetchSlotMinimalTuple(), but that's essentially transient as it's
just used to copy the tuple into the hashtable. We could even gain
quite some efficiency by directly copying into the allocated space
(saving one serialization/copy step for the common case where input is
either a virtual or a heap tuple).
The slightly bigger issue is that several parts of the code currently
require heaptuples they can scribble upon. E.g. nodeModifyTable.c
materializes the tuple - those can be solved by using a local slot to
store the tuple, rather than using the incoming one. In nearly all cases
we'd still be left with the same number of tuple copy steps, as
materializing the tuple with copy into the local storage anyway. A few
other cases, e.g. intorel_receive(), just can use ExecCopySlotTuple() or
such instead of materialize.
The biggest issue I don't have a good answer for, and where I don't like
Haribabu's solution, is how to deal with system columns like oid, ctid,
tableoid. Haribabu's patch adds storage of those to the slot, but that
seems like quite the cludge to me. It appears to me that the best thing
would be to require projections for all accesses to system columns, at
the original level of access. Then we don't need any sort of special
codepaths dealing with them. But that's not exactly a trivial change and
has some added complications too (junkfilter type things).
I made a reference to vectorized execution above. That's because that I
found, while prototyping vectorized execution, that it's a very common
need to interact with parts of the system that aren't vectorized, and
doing so has to be extremely cheap. With the above we can have a
TupleTableSlot type that's essentially just a cheap cursor into a batch
of tuples. Besides making it efficiently possible to hand of slots to
part of the system that can't deal with tuple batches, it allows to do
fun things like having the first slot_deform_tuple() call for one tuple
in a batch to deform a full page's worth of tuples, yielding *much*
better cache access patterns.
I haven't done that, but I think we should split ExecStoreTuple() into
multiple versions, that only work on specific types of tuple
slots. E.g. seqscan et al would call ExecStoreBufferHeapTuple(), other
code dealing with tuples would call ExecStoreHeapTuple(). The relevant
callers already need to know what kind of tuple they're dealing with,
therefore that's not a huge burden.
Please please note that the attached patch is *NOT* meant to be a full
proposal or even a to be reviewed patch, it's just an illustration of
the concepts I'm talking about.
Comments? Better ideas?
|Next Message||Peter Geoghegan||2018-02-20 23:06:50||Re: Hash Joins vs. Bloom Filters / take 2|
|Previous Message||Tom Lane||2018-02-20 22:09:48||Re: SHA-2 functions|