Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

From: Andres Freund <andres(at)anarazel(dot)de>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Michael Paquier <michael(at)paquier(dot)xyz>, Петър Славов <pet(dot)slavov(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY
Date: 2022-05-28 19:34:13
Message-ID: 20220528193413.cvhno3ky4tikuiqo@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2022-05-28 19:46:40 +0500, Andrey Borodin wrote:
> > On 28 May 2022, at 12:02, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > I think you basically need to force some, but not all, of the modifying
> > transactions to be open for a bit longer, so that it's more likely that
> > there's a chance to prune vs CIC waiting. Might also be helpful to update rows
> > multiple times within an xact.
> Now I've got 2 different versions of test for master branch. Both fail in 50% of cases on my machine. Both take approximately 4 seconds of wallclock time and 1 second of CPU time.
>
> v3: wait with a fraction of waiting transactions.
> This test fails with
> 0 postgres 0x00000001049ec508 ExceptionalCondition + 124
> 1 postgres 0x00000001045ea284 heap_page_prune + 2992
> 2 postgres 0x00000001045e9670 heap_page_prune_opt + 424
> 3 postgres 0x00000001045e25c0 heapam_index_fetch_tuple + 140
> 4 postgres 0x0000000100272d60 index_fetch_heap + 104
> 5 postgres 0x0000000100272e18 index_getnext_slot + 88
> 6 postgres 0x00000001003bbf4c check_exclusion_or_unique_constraint + 440
> 7 postgres 0x00000001003bc360 ExecCheckIndexConstraints + 232
> 8 postgres 0x00000001003ea30c ExecInsert + 1024
> 9 postgres 0x00000001003e90cc ExecModifyTable + 1536
> 10 postgres 0x00000001003bd0cc standard_ExecutorRun + 268
> 11 postgres 0x0000000100542d94 ProcessQuery + 160
> 12 postgres 0x00000001005423c8 PortalRunMulti + 396
> 13 postgres 0x0000000100541cfc PortalRun + 476
>
> And reverting d9d0762 does not fix the issue. I'm not sure if I'm observing some other problem here.

I've not been able to reproduce this issue. Even after increasing the number
of clients and transactions, and running the test a number of times. With
d9d0762 reverted, the problem doesn't happen anymore for me.

Any chance you hit this with d9d0762 reverted? It's easy to e.g. revert and
run the tests without recreating the temp-install, to reduce cycle times.

Was there anything else running on the system? c98763bf51bf also needs to
reverted, of course.

> v4 of a test not use pg_sleep() and fails with regular amcheck failure. Reverting d9d0762 fixes the test. Unless I execute the test for 1 million transactions, then it fail even with a revert...

What you're saying is that the revert might not actually fix the problem, or
that you're hitting a separate bug. Correct?

If if it fails after 1 mio xact, what are the symptoms?

It might be worth trying to repro the problem in 13 or such.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2022-05-28 20:52:19 BUG #17501: COPY is failing with "ERROR: invalid byte sequence for encoding "UTF8": 0xe5"
Previous Message Andrey Borodin 2022-05-28 14:46:40 Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY