unique index violation after pg_upgrade to PG10

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Chris Lexvold <lexvold(at)telsasoft(dot)com>
Subject: unique index violation after pg_upgrade to PG10
Date: 2017-10-24 18:14:53
Message-ID: 20171024181453.GI21735@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I upgrade another instance to PG10 yesterday and this AM found unique key
violations.

Our application is SELECTing FROM sites WHERE site_location=$1, and if it
doesn't find one, INSERTs one (I know that's racy and not ideal). We ended up
with duplicate sites, despite a unique index. We removed the duplicate rows
and reindexed fine. This is just a heads up with all the detail I can fit in a
mail (but there's more if needed).

ts=# \d sites
site_id | integer | | not null | nextval('sites_site_id_seq'::regclass)
site_office | text | | |
site_location | text | | |
[...]
Indexes:
"sites_pkey" PRIMARY KEY, btree (site_id)
"sites_idx" UNIQUE, btree (site_office, site_location)

ts=# SELECT site_office, site_location, count(*), min(site_id), max(site_id) FROM sites GROUP BY 1,2 HAVING count(*)>1 ORDER BY 1,2;
site_office | site_location | count | min | max
----------------+---------------+-------+-----+-----
CRCLMT-DOEMS0 | | 2 | 165 | 351
CRCLMT-DOEMS0 | 1101 | 2 | 123 | 343
CRCLMT-DOEMS0 | 1102 | 2 | 134 | 318
CRCLMT-DOEMS0 | 1103 | 2 | 145 | 322
CRCLMT-DOEMS0 | 1104 | 2 | 156 | 329

The duplicate site_ids mean this isn't an issue with row version/visibility due
to XIDs (right?).

ts=# SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND site_location='';
(0 rows)
ts=# SET enable_bitmapscan=off; SET enable_indexscan=off; SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND site_location='';
-[ RECORD 1 ]
?column? | 1

So there's an issue with indices failing to return matching rows (and thereby
allowing inserting duplicate violating rows).

That's the only table/index affected (and the only PG instance thus far
affected).

Note regarding my pg_upgrade: 3 years ago, this was our first and smallest
customer who I upgraded off PG8.4 (to PG9.2 if I recall), and I did it using
pg_dump |pg_restore. I believe, as a consequence, its postgres database was in
"C" locale and ASCII encoding. So the last few upgrades (9.3, .4 .5 and .6), I
believe I've manually used initdb --encoding followed by pg_upgrade (else it
fails due to new postgres/template DBs with different locale/encoding from
old). This upgrade, I finally renamed postgres DB (which has imported CSV logs
and one or two other things) and pg_dump|pg_restore into a new DB with UTF8
encoding, which allowed pg_upgrade to run without special initdb invocation.

I have an LVM snapshot and full CSV logs imported into a table. I also have a
backup from 22:00 which doesn't have duplicate sites. Those seem to have been
inserted by our application around 00:30:

These IDs which inserted duplicate rows:
postgres=# SELECT session_id, max(session_line) FROM postgres_log_2017_10_24_0000 WHERE message LIKE 'statement: SELECT site_id FROM sites WHERE%' GROUP BY 1 ;
session_id | max
---------------+------
59eedfb1.5cea | 714
59eedfb5.5cf1 | 1741
(2 rows)

postgres=# SELECT log_time, session_id, session_line, left(message,333) FROM postgres_log WHERE (session_id='59eedfb1.5cea' OR session_id='59eedfb5.5cf1') AND (session_line<6 OR message LIKE '%INSERT INTO site%') ORDER BY 1,2,3;
-[ RECORD 4 ]+----------------------------------------------------------------------------------------------------------------------------------------------
log_time | 2017-10-24 00:37:37.888-06
session_id | 59eedfb1.5cea
session_line | 4
left | statement: SELECT site_id FROM sites WHERE +
| site_office = 'CRCLMT-DOEMS0' AND site_location = '1203'
-[ RECORD 5 ]+----------------------------------------------------------------------------------------------------------------------------------------------
log_time | 2017-10-24 00:37:37.89-06
session_id | 59eedfb1.5cea
session_line | 5
left | statement: INSERT INTO sites (site_office,site_location,site_alias) +
| VALUES ('CRCLMT-DOEMS0', '1203', (SELECT site_id FROM sites +
| WHERE site_office = 'CRCLMT-CEMS0' AND site_location = '1203'))

Note:
I run a script which does various combinations of ANALYZE/VACUUM (FULL/ANALYZE)
following the upgrade, and a script runs nightly with REINDEX and pg_repack
(and a couple of CLUSTER), so you should assume that any combination of those
maintenance commands have been run.

In our reindex/repack log I found the first error due to duplicates:
Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
WARNING: Error creating index "public"."index_61764": ERROR: could not create unique index "index_61764"
DETAIL: Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated.
WARNING: Skipping index swapping for "sites", since no new indexes built
WARNING: repack failed for "sites_idx"
reindex: warning, dropping invalid/unswapped index: index_61764

postgres=# SELECT * FROM postgres_log WHERE session_id='59eeeb79.4bd5' AND error_severity='ERROR' ORDER BY 1 DESC LIMIT 33;
log_time | 2017-10-24 01:27:53.545-06
user_name | postgres
database | ts
pid | 19413
connection_from | [local]
session_id | 59eeeb79.4bd5
session_line | 10
command_tag | CREATE INDEX
session_start_time | 2017-10-24 01:27:53-06
error_severity | ERROR
message | could not create unique index "index_61764"
detail | Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated.
query | CREATE UNIQUE INDEX CONCURRENTLY index_61764 ON sites USING btree (site_office, site_location) TABLESPACE pg_default
application_name | pg_repack

Disclosure: I see a storage error from 2 days ago (pre-upgrade). This is a
4-drive RAID5 with spare. I can't see that we're missing a spare, so I don't
know what else to do besides consider this message to be spurious..

Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.825577] hpsa 0000:05:00.0: scsi 4:0:0:0 Aborting command ffff8f84f2ba7c00Tag:0x00000000:000001f0 CDBLen: 6 CDB: 0x1201... SN: 0x0 BEING SENT
Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.825583] hpsa 0000:05:00.0: scsi 4:0:0:0: Aborting command RAID HP P410i controller SSDSmartPathCap- En- Exp=1
Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.898930] hpsa 0000:05:00.0: CDB 12018000600000000000000000000000 was aborted with status 0x0
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269963] hpsa 0000:05:00.0: invalid command: LUN:0000000000000000 CDB:00000000f00100000000000000000000
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269966] hpsa 0000:05:00.0: probably means device no longer present
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269968] hpsa 0000:05:00.0: scsi 4:0:0:0 Aborting command ffff8f84f2ba7c00Tag:0x00000000:000001f0 CDBLen: 6 CDB: 0x1201... SN: 0x0 SENT, FAILED
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269974] hpsa 0000:05:00.0: scsi 4:0:0:0: FAILED to abort command RAID HP P410i controller SSDSmartPathCap- En- Exp=1

sites is perhaps our most central table and I would expect issues there to be
quickly apparent (however this was at least initially a silent failure).
public | sites | table | telsasoft | 80 kB |

Justin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-10-24 18:27:04 Re: Current int & float overflow checking is slow.
Previous Message Konstantin Knizhnik 2017-10-24 17:13:28 Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION