Fix longstanding crash-safety bug with newly-created-or-reset sequences.
If a crash occurred immediately after the first nextval() call for a serial
column, WAL replay would restore the sequence to a state in which it
appeared that no nextval() had been done, thus allowing the first sequence
value to be returned again by the next nextval() call; as reported in
bug #6748 from Xiangming Mei.
More generally, the problem would occur if an ALTER SEQUENCE was executed
on a freshly created or reset sequence. (The manifestation with serial
columns was introduced in 8.2 when we added an ALTER SEQUENCE OWNED BY step
to serial column creation.) The cause is that sequence creation attempted
to save one WAL entry by writing out a WAL record that made it appear that
the first nextval() had already happened (viz, with is_called = true),
while marking the sequence's in-database state with log_cnt = 1 to show
that the first nextval() need not emit a WAL record. However, ALTER
SEQUENCE would emit a new WAL entry reflecting the actual in-database state
(with is_called = false). Then, nextval would allocate the first sequence
value and set is_called = true, but it would trust the log_cnt value and
not emit any WAL record. A crash at this point would thus restore the
sequence to its post-ALTER state, causing the next nextval() call to return
the first sequence value again.
To fix, get rid of the idea of logging an is_called status different from
reality. This means that the first nextval-driven WAL record will happen
at the first nextval call not the second, but the marginal cost of that is
pretty negligible. In addition, make sure that ALTER SEQUENCE resets
log_cnt to zero in any case where it touches sequence parameters that
affect future nextval results. This will result in some user-visible
changes in the contents of a sequence's log_cnt column, as reflected in the
patch's regression test changes; but no application should be depending on
that anyway, since it was already true that log_cnt changes rather
unpredictably depending on checkpoint timing.
In addition, make some basically-cosmetic improvements to get rid of
sequence.c's undesirable intimacy with page layout details. It was always
really trying to WAL-log the contents of the sequence tuple, so we should
have it do that directly using a HeapTuple's t_data and t_len, rather than
backing into it with some magic assumptions about where the tuple would be
on the sequence's page.
Back-patch to all supported branches.
src/backend/commands/sequence.c | 146 ++++++++++++++++++------------
src/test/regress/expected/sequence.out | 4 +-
src/test/regress/expected/sequence_1.out | 4 +-
3 files changed, 93 insertions(+), 61 deletions(-)
pgsql-committers by date
|Next:||From: Bruce Momjian||Date: 2012-07-26 10:22:25|
|Subject: pgsql: Simplify pg_upgrade's handling when returning directorylistings|
|Previous:||From: Peter Eisentraut||Date: 2012-07-25 19:04:25|
|Subject: pgsql: Document that pg_basebackup will create its output directory|