Re: BF mamba failure

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BF mamba failure
Date: 2023-03-20 06:10:46
Message-ID: CAHut+PvVrjwJm_9ZqnXJk4x9k8dN0dYrV+T5_Rd30BSneDhv1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 19, 2023 at 2:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> Hi,
>
> 18.03.2023 07:26, Tom Lane wrote:
>
> Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
>
> Peter Smith has recently reported a BF failure [1]. AFAICS, the call
> stack of failure [2] is as follows:
>
> Note the assertion report a few lines further up:
>
> TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004
>
>
> This assertion failure can be reproduced easily with the attached patch:
> ============== running regression test queries ==============
> test oldest_xmin ... ok 55 ms
> test oldest_xmin ... FAILED (test process exited with exit code 1) 107 ms
> test oldest_xmin ... FAILED (test process exited with exit code 1) 8 ms
> ============== shutting down postmaster ==============
>
> contrib/test_decoding/output_iso/log/postmaster.log contains:
> TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 561, PID: 456844
>
> With the sleep placed above Assert(entry_ref->shared_entry->dropped) this Assert fails too.
>
> Best regards,
> Alexander

I used a slightly modified* patch of Alexander's [1] applied to the
latest HEAD code (but with my "toptxn" patch reverted).
--- the patch was modified in that I injected 'sleep' both above and
below the Assert(entry_ref->shared_entry->dropped).

Using this I was also able to reproduce the problem. But test failures
were rare. The make check-world seemed OK, and indeed the
test_decoding tests would also appear to PASS around 14 out of 15
times.

============== running regression test queries ==============
test oldest_xmin ... ok 342 ms
test oldest_xmin ... ok 121 ms
test oldest_xmin ... ok 283 ms
============== shutting down postmaster ==============
============== removing temporary instance ==============

=====================
All 3 tests passed.
=====================

~~

Often (but not always) depite the test_decoding reported PASS all 3
tests as "ok", I still observed there was a TRAP in the logfile
(contrib/test_decoding/output_iso/log/postmaster.log).
TRAP: failed Assert("entry_ref->shared_entry->dropped")

~~

Occasionally (about 1 in 15 test runs) the test would fail the same
way as described by Alexander [1], with the accompanying TRAP.
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount)
== 0"), File: "pgstat_shmem.c", Line: 562, PID: 32013

============== running regression test queries ==============
test oldest_xmin ... ok 331 ms
test oldest_xmin ... ok 91 ms
test oldest_xmin ... FAILED 702 ms
============== shutting down postmaster ==============

======================
1 of 3 tests failed.
======================

~~

FWIW, the "toptxn" patch. whose push coincided with the build-farm
error I first reported [2], turns out to be an innocent party in this
TRAP. We know this because all of the above results were running using
HEAD code but with that "toptxn" patch reverted.

------
[1] https://www.postgresql.org/message-id/1941b7e2-be7c-9c4c-8505-c0fd05910e9a%40gmail.com
[2] https://www.postgresql.org/message-id/CAHut%2BPsHdWFjU43VEX%2BR-8de6dFQ-_JWrsqs%3DvWek1hULexP4Q%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-03-20 06:17:51 Re: Privileges on PUBLICATION
Previous Message Masahiko Sawada 2023-03-20 05:24:38 Re: [PoC] Improve dead tuple storage for lazy vacuum