BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: kudzia(at)gmail(dot)com
Subject: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Date: 2020-12-27 17:18:36
Message-ID: 16792-b1913b6b4e098331@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16792
Logged by: pawel kudzia
Email address: kudzia(at)gmail(dot)com
PostgreSQL version: 11.10
Operating system: Debian Buster x86_64
Description:

I'd like to ask for advice how to proceed with pin-pointing cause of the
silent corruption of GIN index that I'm facing.

Relevant part of the table structure:

CREATE TABLE public.entity (
entity_id bigint NOT NULL,
attribute_value_ids integer[] NOT NULL,
attribute_name_ids integer[] NOT NULL,
[..]
);

ALTER TABLE ONLY public.entity ADD CONSTRAINT entity_pkey PRIMARY KEY
(entity_id);
CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids public.gin__int_ops);
CREATE INDEX entity_attribute_value_ids_gin ON public.entity USING gin
(attribute_value_ids public.gin__int_ops);

How does the issue manifest?

Queries which use GIN on integer[] column occasionally return too many rows,
including ones that actually do not match criteria expressed in WHERE.
Queries like below should never return any rows, yet - occasionally they do
return some results:

data=> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{4980}' )
AND NOT ( (attribute_name_ids||0) && '{4980}') ;
entity_id
-----------
31213924
31195117
31209184
(3 rows)

Query plan:
QUERY PLAN
-------------------------------------------------------------------------------------------------
Bitmap Heap Scan on entity (cost=50.39..3068.11 rows=26923 width=22)
Recheck Cond: (attribute_name_ids && '{4980}'::integer[])
Filter: (NOT ((attribute_name_ids || 0) && '{4980}'::integer[]))
-> Bitmap Index Scan on entity_attribute_name_ids_gin (cost=0.00..50.39
rows=27058 width=0)
Index Cond: (attribute_name_ids && '{4980}'::integer[])
(5 rows)

The query was crafted intentionally to let PostgreSQL use GIN index
entity_attribute_name_ids_gin [ left side of AND ] and then filter out rows
that after inspection of their values do not actually match that criteria.

What have I done so far?

Originally the problem was discovered when running PostgreSQL 11.7 from
Debian's repo. In the first step we've upgraded to PostgreSQL 11.9 also from
Debian's repo and run:

reindex (verbose) table entity; vacuum (verbose, full, analyze) entity;

After few days of updates the problem returned.

We've upgraded to PostgreSQL 11.10 from postgresql.org repository for Debian
and reindexed / vacuumed again. After few weeks of updates problem returned
again.

Other information worth noting:
* table in question has 38M rows and is the only table created after
PostgreSQL installation
* server in question is master of streaming replication; issue occurs also
on the slave servers - this suggests corruption of the on-disk GIN index
data
* just rewriting rows falsely returned by select above - UPDATE entity SET
attribute_name_ids ='{....}' WHERE entity_id=123 - fixes the issue
* we've set up rudimentary consistency checks that allow to catch some of
the corruptions within 1h of occurring - so far corruptions only happen to
rows that have been updated since the previous check
* server in question handles heavy read/write traffic
* PostgreSQL in question runs in LXC container with Debian Buster running on
top of Debian Buster running on top of bare-metal server
* it's highly unlikely that hardware malfunction is to be blamed - the same
physical server handles also few MySQL instances with hundreds of GB of data
with heavy consistency checks cross checking content of MySQL with
PostgreSQL, unpacking gzip'ed blobs stored in MySQL databases and
de-serializing objects stored there. if there was a bit-rot / bit-flip in
memory would that's not detected or fixed by ECC - high level consistency
checks would pick it.
* despite numerous attempts i cannot reproduce the issue in test
environment, possibly due to much lower level of write traffic. it takes
days to weeks to have it re-occur on the production server.

Thank you in advance for your suggestions how to tackle this.

best regards,
Pawel

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Justin Pryzby 2020-12-27 19:07:29 Re: pg_upgrade test for binary compatibility of core data types
Previous Message Zhiyu ZY13 Xu 2020-12-27 11:18:57 could not fork new process and out of memory issues PG 11.5