Re: sequences vs. synchronous replication

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: sequences vs. synchronous replication
Date: 2021-12-23 18:50:22
Message-ID: a58f1c38-10eb-7d7f-372c-cfb95007d54e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/23/21 15:42, Fujii Masao wrote:
>
>
> On 2021/12/23 3:49, Tomas Vondra wrote:
>> Attached is a patch tweaking WAL logging - in wal_level=minimal we do
>> the same thing as now, in higher levels we log every sequence fetch.
>
> Thanks for the patch!
>
> With the patch, I found that the regression test for sequences failed.
>
> +            fetch = log = fetch;
>
> This should be "log = fetch"?
>
> On second thought, originally a sequence doesn't guarantee that the
> value already returned by nextval() will never be returned by subsequent
> nextval() after the server crash recovery. That is, nextval() may return
> the same value across crash recovery. Is this understanding right? For
> example, this case can happen if the server crashes after nextval()
> returned the value but before WAL for the sequence was flushed to the
> permanent storage.

I think the important step is commit. We don't guarantee anything for
changes in uncommitted transactions. If you do nextval in a transaction
and the server crashes before the WAL gets flushed before COMMIT, then
yes, nextval may generate the same nextval again. But after commit that
is not OK - it must not happen.

> So it's not a bug that sync standby may return the same value as
> already returned in the primary because the corresponding WAL has not
> been replicated yet, isn't it?
>

No, I don't think so. Once the COMMIT happens (and gets confirmed by the
sync standby), it should be possible to failover to the sync replica
without losing any data in committed transaction. Generating duplicate
values is a clear violation of that.

IMHO the fact that we allow a transaction to commit (even just locally)
without flushing all the WAL it depends on is clearly a data loss bug.

> BTW, if the returned value is stored in database, the same value is
> guaranteed not to be returned again after the server crash or by sync
> standby. Because in that case the WAL of the transaction storing that
> value is flushed and replicated.
>

True, assuming the table is WAL-logged etc. I agree the issue may be
affecting a fairly small fraction of workloads, because most people use
sequences to generate data for inserts etc.

>> So I think this makes it acceptable / manageable. Of course, this
>> means the values are much less monotonous (across backends), but I
>> don't think we really promised that. And I doubt anyone is really
>> using sequences like this (just nextval) in performance critical use
>> cases.
>
> I think that this approach is not acceptable to some users. So, if we
> actually adopt WAL-logging every sequence fetch, also how about exposing
> SEQ_LOG_VALS as reloption for a sequence? If so, those who want to log
> every sequence fetch can set this SEQ_LOG_VALS reloption to 0. OTOH,
> those who prefer the current behavior in spite of the risk we're
> discussing at this thread can set the reloption to 32 like it is for
> now, for example.
>

I think it'd be worth explaining why you think it's not acceptable?

I've demonstrated the impact on regular workloads (with other changes
that write stuff to WAL) is not measurable, and enabling sequence
caching eliminates most of the overhead for the rare corner case
workloads if needed. It does generate a bit more WAL, but the sequence
WAL records are pretty tiny.

I'm opposed to adding relooptions that affect correctness - it just
seems like a bad idea to me. Moreover setting the CACHE for a sequence
does almost the same thing - if you set CACHE 32, we only generate WAL
once every 32 increments. The only difference is that this cache is not
shared between backends, so one backend will generate 1,2,3,... and
another backend will generate 33,34,35,... etc. I don't think that's a
problem, because if you want strictly monotonous / gap-less sequences
you can't use our sequences anyway. Yes, with short-lived backends this
may consume the sequences faster, but well - short-lived backends are
expensive anyway and overflowing bigserial is still unlikely.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua Brindle 2021-12-23 21:06:45 Re: CREATEROLE and role ownership hierarchies
Previous Message Максим Орлов 2021-12-23 18:05:47 Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.