Quick Links

Re: Pluggable Storage - Andres's take

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Pluggable Storage - Andres's take
Date:	2018-07-05 12:25:25
Message-ID:	CAPpHfdvedH1bTicj24J7iGb8sHdrr62+DVPic2kJLasFDfk-=Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi!

On Tue, Jul 3, 2018 at 10:06 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> As I've previously mentioned I had planned to spend some time to polish
> Haribabu's version of the pluggable storage patch and rebase it on the
> vtable based slot approach from [1]. While doing so I found more and
> more things that I previously hadn't noticed. I started rewriting things
> into something closer to what I think we want architecturally.

Great, thank you for working on this patchset!

> The current state of my version of the patch is *NOT* ready for proper
> review (it doesn't even pass all tests, there's FIXME / elog()s). But I
> think it's getting close enough to it's eventual shape that more eyes,
> and potentially more hands on keyboards, can be useful.
>
> The most fundamental issues I had with Haribabu's last version from [2]
> are the following:
>
> - The use of TableTuple, a typedef from void *, is bad from multiple
> fronts. For one it reduces just about all type safety. There were
> numerous bugs in the patch where things were just cast from HeapTuple
> to TableTuple to HeapTuple (and even to TupleTableSlot). I think it's
> a really, really bad idea to introduce a vague type like this for
> development purposes alone, it makes it way too hard to refactor -
> essentially throwing the biggest benefit of type safe languages out of
> the window.
>
> Additionally I think it's also the wrong approach architecturally. We
> shouldn't assume that a tuple can efficiently be represented as a
> single palloc'ed chunk. In fact, we should move *away* from relying on
> that so much.
>
> I've thus removed the TableTuple type entirely.

+1, TableTuple was vague concept.

> - Previous verions of the patchset exposed Buffers in the tableam.h API,
> performed buffer locking / pinning / ExecStoreTuple() calls outside of
> it. That is wrong in my opinion, as various AMs will deal very
> differently with buffer pinning & locking. The relevant logic is
> largely moved within the AM. Bringing me to the next point:
>
>
> - tableam exposed various operations based on HeapTuple/TableTuple's
> (and their Buffers). This all need be slot based, as we can't
> represent the way each AM will deal with this. I've largely converted
> the API to be slot based. That has some fallout, but I think largely
> works. Lots of outdated comments.

Makes sense for me. I like passing TupleTableSlot to tableam api
function much more.

> - I think the move of the indexing from outside the table layer into the
> storage layer isn't a good idea. It lead to having to pass EState into
> the tableam, a callback API to perform index updates, etc. This seems
> to have at least partially been triggered by the speculative insertion
> codepaths. I've reverted this part of the changes. The speculative
> insertion / confirm codepaths are now exposed to tableam.h - I think
> that's the right thing because we'll likely want to have that
> functionality across more than a single tuple in the future.

I agree that passing EState into tableam doesn't look good. But I
believe that tableam needs way more control over indexes than it has
in your version patch. If even tableam-independent insertion into
indexes on tuple insert is more or less ok, but on update we need
something smarter rather than just insert index tuples depending
"update_indexes" flag. Tableam specific update method may decide to
update only some of indexes. For example, when zheap performs update
in-place then it inserts to only indexes, whose fields were updated.
And I think any undo-log based storage would have similar behavior.
Moreover, it might be required to do something with existing index
tuples (for instance, as I know, zheap sets "deleted" flag to index
tuples related to previous values of updated fields).

If we would like to move indexing outside of tableam, then we might
turn "update_indexes" from bool to more enum with values like: "don't
insert index tuples", "insert all index tuples", "insert index tuples
only for updated fields" and so on. But that looks more like set of
hardcoded cases for particular implementations, than proper API. So,
probably we shouldn't move indexing outside of tableam, but rather
provide better wrappers for doing that in tableam?

> - The visibility functions relied on the *caller* performing buffer
> locking. That's not a great idea, because generic code shouldn't know
> about the locking scheme a particular AM needs. I've changed the
> external visibility functions to instead take a slot, and perform the
> necessary locking inside.

Makes sense for me. But would it cause extra locking/unlocking and in
turn performance impact?

> - There were numerous tableam callback uses inside heapam.c - that makes
> no sense, we know what the storage is therein. The relevant

Ok.

> - The integration between index lookups and heap lookups based on the
> results on a index lookup was IMO too tight. The index code dealt
> with heap tuples, which isn't great. I've introduced a new concept, a
> 'IndexFetchTableData' scan. It's initialized when building an index
> scan, and provides the necessary state (say current heap buffer), to
> do table lookups from within a heap.

> - The am of relations required for bootstrapping was set to 0 - I don't
> think that's a good idea. I changed it so it's set to the heap AM as
> well.

> - HOT was encoded in the API in a bunch of places. That doesn't look
> right to me. I tried to improve a bit on that, but I'm not yet quite
> sure I like it. Needs written explanation & arguments...

Yes, HOT is heapam specific feature. Other tableams might not have
HOT. But it appears that we still expose hot_search_buffer() function
in tableam API. But that function have no usage, so it's just
redundant and can be removed.

> - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
> a higher memory usage, because the resulting tuples weren't freed or
> anything. There might be a reason for doing such a change - we've
> certainly discussed that before - but I'm *vehemently* against doing
> that at the same time we introduce pluggable storage. Analyzing the
> performance effects will be hard enough without changes like this.

I think once we switched to slots, doing heap_copytuple() do
frequently is not required anymore.

> - I've for now backed out the heap rewrite changes, partially. Mostly
> because I didn't like the way the abstraction looks, but haven't quite
> figured out how it should look like.

Yeah, it's hard part, but we need to invent something in this area...

> - I did not like that speculative tokens were moved to slots. There's
> really no reason for them to live outside parameters to tableam.h
> functsions.

Good.

> My current working state is at [3] (urls to clone repo are at [4]).
> This is *HEAVILY WIP*. I plan to continue working on it over the next
> days, but I'll temporarily focus onto v11 work. If others want I could
> move repo to github and grant others write access.

Github would be more convinient for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Pluggable Storage - Andres's take at 2018-07-03 07:06:45 from Andres Freund

Responses

Re: Pluggable Storage - Andres's take at 2018-07-05 13:23:40 from Alexander Korotkov
Re: Pluggable Storage - Andres's take at 2018-07-16 13:44:53 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Larry Rosenman	2018-07-05 12:47:39	Re: peripatus build failures....
Previous Message	David Rowley	2018-07-05 11:52:35	Re: Generating partitioning tuple conversion maps faster