Re: Corruption with duplicate primary key

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alex Adriaanse <alex(at)oseberg(dot)io>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Corruption with duplicate primary key
Date: 2019-12-12 23:25:26
Message-ID: 20191212232526.zj3dqncjrc7pyckb@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote:
>On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:
>> At first I thought maybe this might be due to collations changing and
>> breaking the index silently. What collation are you using?
>
>We're using en_US.utf8. We did not make any collation changes to my
>knowledge.
>

Well, the idea was more that glibc got updated and the collations
changed because of that (without PostgreSQL having a chance to even
notice that).

>> 1) When you do the queries, do they use index scan or sequential
>> scan? Perhaps it does sequential scan, and if you force index scan
>> (e.g. by rewriting the query) it'll only find one of those rows.
>
>By default it used an index scan. When I re-ran the query today (and
>confirmed that the query used an index only scan) I did not see any
>duplicates. If I force a sequential scan using "SET
>enable_index[only]scan = false" the duplicates reappear.
>

Hmmm, that's probably a sign of some sort of index corruption. Clearly,
when a row can't be found through an index, it's invisible to code
enforcing the unique constraint (relying on the index).

>However, using a backup from a week ago I see duplicates in both the
>query that uses an index only scan as well as the query that uses the
>sequential scan. So somehow over the past week the index got changed to
>eliminate duplicates.
>

Hmmm, that's interesting ... and confusing.

The good thing is that this is not an upgrade issue, because there was
no corruption right after the upgrade.

But then apparently the corruption appeared, and then disappeared for
some unknown reason, but only from the index. Puzzling.

>> 2) Can you check in backups if this data corruption was present in
>> the PG10 cluster, before running pg_upgrade?
>
>Sure. I just checked and did not see any corruption in the PG10
>pre-upgrade backup. I also re-upgraded that PG10 backup to PG12, and
>right after the upgrade I did not see any corruption either. I checked
>using both index scans and sequential scans.
>

OK, thanks. That's valuable piece of information.

How active is the system and can you do PITR? That is, can you try
restoring it into different points in time by replaying WAL? Then we
could check narrow-down when the corruption appeared and inspect the WAL
from that period.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Will Leinweber 2019-12-12 23:32:05 Errors "failed to construct the join relation" and "failed to build any 2-way joins"
Previous Message Andres Freund 2019-12-12 23:25:21 Re: tuplesort test coverage