Re: BUG #4565: nextval not updated during wal replication, leading to pk violations

From: Marc Schablewski <ms(at)clickware(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #4565: nextval not updated during wal replication, leading to pk violations
Date: 2008-12-09 13:02:59
Message-ID: 493E6C83.9060200@clickware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tom Lane wrote:
> "Marc Schablewski" <ms(at)clickware(dot)de> writes:
>
>> A job tried to insert some records into a few tables and complained about a
>> PK violation. PK are of type bigint. The column is populated by a sequence
>> and a "default nextval()" on the PK column. We found that the sequence's
>> currval was lower than the maximum value in the table and that this value
>> was already present. Further investigation showed us that the last records
>> in the table were inserted on the former master server while taking the
>> initial online backup for the replication. It seems that the records got
>> replicated but not the currval/nextval of the sequence. When running "select
>> nextval()" on the backup database it returned the PK value of the first
>> record inserted during that last run on the former master server.
>>
>
> How many sequence values "overlapped" here, exactly? If more than one,
> are you by any chance running that sequence with a CACHE setting larger
> than one? (See the sequence's cache_value column if you're unsure.)
>
> I can see what might be a race condition between nextval()'s WAL logging
> and checkpoint start, but it doesn't seem like it could account for more
> than the CACHE setting worth of overlap.
>
> regards, tom lane
>
>
Thanks for your reply.

cache_value is set to one for all sequences. As far as I can tell, they
were all created by a plain CREATE SEQUENCE seq_name and no other
settings changed. And as we found out later this "loss of information"
hit some indexes as well.

We took a fresh backup yesterday and again we saw that only the last WAL
generated during the backup run was restored on our backup system. I'm
sure that this is the real problem.

START WAL LOCATION: 4E/36F2C280 (file 000000020000004E00000036)
STOP WAL LOCATION: 4E/3944C988 (file 000000020000004E00000039)
CHECKPOINT LOCATION: 4E/36F2C280
START TIME: 2008-12-08 17:17:34 CET
LABEL: backup-20081208-171733
STOP TIME: 2008-12-08 18:22:25 CET

File 000000020000004E00000039 and all WAL that were generated afterwards
were restored. Files ...36 to ...38 weren't even requested. After
starting up the database, we had a table with records that were not
present it's PK index. But the were in the index on the former master
system. I ran an explicit pg_switch_xlog() on the master database before
switching to the backup to make sure all changes got transferred.

I'll check our scripts that do the backup/archiving/recovery again,
there still might be some kind of bug. But beside that, I'd be thankful
for any hints what else I could check.

Marc

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dave Page 2008-12-09 13:06:25 Re: BUG #4573: download problem
Previous Message Pavan Deolasee 2008-12-09 12:47:33 Re: BUG #4573: download problem