Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

From: Pawel Kudzia <kudzia(at)gmail(dot)com>
To: Pawel Kudzia <kudzia(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Date: 2021-06-17 06:40:18
Message-ID: CAJYBUS_5Z+9aHyNQ7wkTEkZF+c_RMn7hMjPkTGU0Uc7OHMU3sQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Based on the suggestions we've received from IRC we've:
* Moved some of our postgresql instances to another
physical server,
* Enabled data_checksums on all instances,
* Upgraded upgraded our servers to 13.3
(Debian 13.3-1.pgdg100+1) with hope that
https://github.com/postgres/postgres/commit/0d779d22a290a89b6c892137a37280b9588ad0cc
addresses the issue [ although it's ts-vector
specific, so unlikely to be related to our issue ].

Yet soon after we've observe another inconsistency.

Recap of the problem - SELECTs return rows, based on GIN,
that do not actually meet criteria expressed in WHERE.

SELECT entity_id,attribute_name_ids FROM entity WHERE
( attribute_name_ids && '{7572}' ) AND NOT
( (attribute_name_ids||0) && '{7572}') LIMIT 100 ;

entity_id | attribute_name_ids
-----------+----------------------------------------------------------------------------------------
22327791 | {1,2,3,4,6,8,9,10,11,13,14,17,19,21,35,72,366,1659,2208,2211,3270,3273,3279,5787,7650}

(1 row)

EXPLAIN ANALYZE of the query above:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=35.46..46.67 rows=100 width=132) (actual
time=307.221..335.842 rows=1 loops=1)
-> Bitmap Heap Scan on entity (cost=35.46..3705.32 rows=32724
width=132) (actual time=307.220..335.839 rows=1 loops=1)
Recheck Cond: (attribute_name_ids && '{7572}'::integer[])
Rows Removed by Index Recheck: 72012
Filter: (NOT ((attribute_name_ids || 0) && '{7572}'::integer[]))
Rows Removed by Filter: 21501
Heap Blocks: exact=8998 lossy=9257
-> Bitmap Index Scan on entity_attribute_name_ids_gin
(cost=0.00..35.46 rows=32889 width=0) (actual time=19.790..19.790
rows=115485 loops=1)
Index Cond: (attribute_name_ids && '{7572}'::integer[])
Planning Time: 0.068 ms
Execution Time: 335.879 ms
(11 rows)

Relevant part of the table structure:

CREATE TABLE public.entity (
entity_id bigint NOT NULL,
attribute_value_ids integer[] NOT NULL,
attribute_name_ids integer[] NOT NULL,
[..]
);

ALTER TABLE ONLY public.entity ADD CONSTRAINT entity_pkey PRIMARY KEY
(entity_id);
CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids public.gin__int_ops);
CREATE INDEX entity_attribute_value_ids_gin ON public.entity USING gin
(attribute_value_ids public.gin__int_ops);

There are at leat tens of updates / inserts per second to the table all the
time. the issue above manifests very rarely, after few days if not weeks of
uptime. we did not find any deterministic way of reproducing it, but it's
a matter of time till it shows.

The issue persist is related to on-disk data, it replicates from
streaming replication
masters to slaves.

Thank you in advance for your suggestions how to tackle this.

--
regards,
Pawel Kudzia

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Pawel Kudzia 2021-06-17 08:00:10 Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Previous Message Pantelis Theodosiou 2021-06-16 10:47:53 Re: BUG #17060: ERROR: column "rownum" does not exist