Re: Sequence Access Method WIP

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Sequence Access Method WIP
Date: 2014-11-04 21:01:52
Message-ID: 54593EC0.4030009@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/11/14 13:11, Heikki Linnakangas wrote:
> On 10/13/2014 01:01 PM, Petr Jelinek wrote:
>> Hi,
>>
>> I rewrote the patch with different API along the lines of what was
>> discussed.
>
> Thanks, that's better.
>
> It would be good to see an alternative seqam to implement this API, to
> see how it really works. The "local" one is too dummy to expose any
> possible issues.
>

Yeah, I don't know what that would be, It's hard for me to find
something that would sever purely demonstration purposes. I guess I
could port BDR sequences to this if it would help (once we have bit more
solid agreement that the proposed API at least theoretically seems ok so
that I don't have to rewrite it 10 times if at all possible).

>> ...
>> Only the alloc and reloptions methods are required (and implemented by
>> the local AM).
>>
>> The caching, xlog writing, updating the page, etc is handled by backend,
>> the AM does not see the tuple at all. I decided to not pass even the
>> struct around and just pass the relevant options because I think if we
>> want to abstract the storage properly then the AM should not care about
>> how the pg_sequence looks like at all, even if it means that the
>> sequence_alloc parameter list is bit long.
>
> Hmm. The division of labour between the seqam and commands/sequence.c
> still feels a bit funny. sequence.c keeps track of how many values have
> been WAL-logged, and thus usable immediately, but we still call
> sequence_alloc even when using up those already WAL-logged values. If
> you think of using this for something like a centralized sequence server
> in a replication cluster, you certainly don't want to make a call to the
> remote server for every value - you'll want to cache them.
>
> With the "local" seqam, there are two levels of caching. Each backend
> caches some values (per the CACHE <value> option in CREATE SEQUENCE). In
> addition to that, the server WAL-logs 32 values at a time. If you have a
> remote seqam, it would most likely add a third cache, but it would
> interact in strange ways with the second cache.
>
> Considering a non-local seqam, the locking is also a bit strange. The
> server keeps the sequence page locked throughout nextval(). But if the
> actual state of the sequence is maintained elsewhere, there's no need to
> serialize the calls to the remote allocator, i.e. the sequence_alloc()
> calls.
>
> I'm not exactly sure what to do about that. One option is to completely
> move the maintenance of the "current" value, i.e. sequence.last_value,
> to the seqam. That makes sense from an abstraction point of view. For
> example with a remote server managing the sequence, storing the
> "current" value in the local catalog table makes no sense as it's always
> going to be out-of-date. The local seqam would store it as part of the
> am-private data. However, you would need to move the responsibility of
> locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to
> provide an API that the seqam can call to do that. Perhaps just let the
> seqam call heap_inplace_update on the sequence relation.
>

My idea of how this works is - sequence_next handles the allocation and
the level2 caching (WAL logged cache) via amdata if it supports it, or
returns single value if it doesn't - then the WAL will always just write
the 1 value and there will basically be no level2 cache, since it is the
sequence_next who controls how much will be WAL-logged, what backend
asks for is just "suggestion".

The third level caching as you say should be solved by the
sequence_request_update and sequence_update callback - that will enable
the sequence AM to handle this kind of caching asynchronously without
blocking the sequence_next unnecessarily.

That way it's possible to implement many different strategies, from no
cache at all, to three levels of caching, with and without blocking of
calls to sequence_next.

>> For the amdata handling (which is the AM's private data variable) the
>> API assumes that (Datum) 0 is NULL, this seems to work well for
>> reloptions so should work here also and it simplifies things a little
>> compared to passing pointers to pointers around and making sure
>> everything is allocated, etc.
>>
>> Sadly the fact that amdata is not fixed size and can be NULL made the
>> page updates of the sequence relation quite more complex that it used to
>> be.
>
> It would be nice if the seqam could define exactly the columns it needs,
> with any datatypes. There would be a set of common attributes:
> sequence_name, start_value, cache_value, increment_by, max_value,
> min_value, is_cycled. The local seqam would add "last_value", "log_cnt"
> and "is_called" to that. A remote seqam that calls out to some other
> server might store the remote server's hostname etc.
>
> There could be a seqam function that returns a TupleDesc with the
> required columns, for example.
>

Wouldn't that somewhat bloat catalog if we had new catalog table for
each sequence AM? It also does not really solve the amdata being dynamic
size "issue". I think just dynamic field where AM stores whatever it
wants is enough (ie the current solution), it's just that the hackery
that sequences storage implementation does is more visible once there
are non-fixed fields.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2014-11-04 21:03:12 Re: tracking commit timestamps
Previous Message Peter Eisentraut 2014-11-04 20:52:12 Re: pg_basebackup fails with long tablespace paths