Re: Sequence Access Method WIP

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Sequence Access Method WIP
Date: 2013-11-18 10:50:21
Message-ID: 5289F0ED.20708@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18.11.2013 11:48, Andres Freund wrote:
> On 2013-11-18 10:54:42 +0200, Heikki Linnakangas wrote:
>> On 15.11.2013 20:21, Andres Freund wrote:
>>> Well, it exposes log_sequence_tuple() - together with the added "am
>>> private" column of pg_squence that allows to do quite a bit of different
>>> things. I think unless we really implement pluggable RMGRs - which I
>>> don't really see coming - we cannot be radically different.
>>
>> The proposed abstraction leaks like a sieve. The plugin should not need to
>> touch anything but the private amdata field. log_sequence_tuple() is way too
>> low-level.
>
> Why? It's *less* low level than a good part of other crash recovery code
> we have. I have quite some doubts about the layering, but it's imo
> pretty sensible to centralize the wal logging code that plugins couldn't
> write.

It doesn't go far enough, it's still too *low*-level. The sequence AM
implementation shouldn't need to have direct access to the buffer page
at all.

> If you look at what e.g the _alloc callback currently gets passed.
> a) Relation: Important for metadata like TupleDesc, name etc.
> b) SeqTable entry: Important to setup state for cur/lastval
> c) Buffer of the tuple: LSN etc.
> d) HeapTuple itself: a place to store the tuples changed state
>
> And if you then look at what gets modified in that callback:
> Form_pg_sequence->amdata
> ->is_called
> ->last_value
> ->log_cnt
> SeqTable ->last
> SeqTable ->cached
> SeqTable ->last_valid
>
> You need is_called, last_valid because of our current representation of
> sequences state (which we could change, to remove that need). The rest
> contains values that need to be set depending on how you want your
> sequence to behave:
> * Amdata is obvious.
> * You need log_cnt to influence/remember how big the chunks are you WAL log at
> once are. Which e.g. you need to control if want a sequence that
> doesn't leak values in crashes
> * cached is needed to control the per-backend caching.

I don't think the sequence AM should be in control of 'cached'. The
caching is done outside the AM. And log_cnt probably should be passed to
the _alloc function directly as an argument, ie. the server code asks
the AM to allocate N new values in one call.

I'm thinking that the alloc function should look something like this:

seqam_alloc(Relation seqrel, int nrequested, Datum am_private)

And it should return:

int64 value - the first value allocated.
int nvalues - the number of values allocated.
am_private - updated private data.

The backend code handles the caching and logging of values. When it has
exhausted all the cached values (or doesn't have any yet), it calls the
AM's alloc function to get a new batch. The AM returns the new batch,
and updates its private state as necessary. Then the backend code
updates the relation file with the new values and the AM's private data.
WAL-logging and checkpointing is the backend's responsibility.

>> Just as a thought-experiment, imagine that we decided to re-implement
>> sequences so that all the sequence values are stored in one big table, or
>> flat-file in the data directory, instead of the current implementation where
>> we have a one-block relation file for each sequence. If the sequence access
>> method API is any good, it should stay unchanged. That's clearly not the
>> case with the proposed API.
>
> I don't think we can entirely abstract away the reality that now they are
> based on relations and might not be at some later point. Otherwise we'd
> have to invent an API that copied all the data that's stored in struct
> RelationData. Like name, owner, ...
> Which non SQL accessible API we provide has a chance of providing that
> level of consistency in the face of fundamental refactoring? I'd say
> none.

I'm not thinking that we'd change sequences to not be relations. I'm
thinking that we might not want to store the state as a one-page file
anymore. In fact that was just discussed in the other thread titled
"init_sequence spill to hash table".

> b) It's not actually easy to get similar semantics in "userspace". How
> would you emulate the crash-safe but non-transactional semantics of
> sequences without copying most of sequence.c? Without writing
> XLogInsert()s which we cannot do from a out-of-tree code?

heap_inplace_update()

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2013-11-18 10:52:29 Re: REINDEX CONCURRENTLY 2.0
Previous Message Sameer Thakur 2013-11-18 09:54:16 Re: pg_stat_statements: calls under-estimation propagation