Re: pg_stat_statements: more test coverage

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_stat_statements: more test coverage
Date: 2023-12-31 18:42:35
Message-ID: 3768231.1704048155@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Eisentraut <peter(at)eisentraut(dot)org> writes:
> It looks like the failing configurations are exactly all the big-endian
> ones: s390x, sparc, powerpc. So it's possible that this is actually a
> bug? But unless someone can reproduce this locally and debug it, we
> should probably revert this for now.

The reason for the platform-dependent behavior is that we're dealing
with a bunch of entries with identical "usage", so it's just about
random which ones actually get deleted. I do not think our qsort()
has platform-dependent behavior --- but the initial contents of
entry_dealloc's array are filled in hash_seq_search order, and that
*is* platform-dependent.

Now, the test script realizes that hazard. The bug seems to be that
it's wrong about how many competing usage-count-1 entries there are.
Instrumenting the calls to entry_alloc (which'll call entry_dealloc
when we hit 100 entries), I see

2023-12-31 13:21:39.160 EST client backend[3764867] pg_regress/max LOG: calling entry_alloc for 'SELECT pg_stat_statements_reset() IS NOT', cur hash size 0
2023-12-31 13:21:39.160 EST client backend[3764867] pg_regress/max LOG: calling entry_alloc for '$1', cur hash size 1
2023-12-31 13:21:39.160 EST client backend[3764867] pg_regress/max CONTEXT: SQL expression "1"
PL/pgSQL function inline_code_block line 3 at FOR with integer loop variable
2023-12-31 13:21:39.160 EST client backend[3764867] pg_regress/max LOG: calling entry_alloc for 'format($3, lpad(i::text, $4, $5))', cur hash size 2
2023-12-31 13:21:39.160 EST client backend[3764867] pg_regress/max CONTEXT: SQL expression "format('select * from t%s', lpad(i::text, 3, '0'))"
PL/pgSQL function inline_code_block line 4 at EXECUTE
2023-12-31 13:21:39.160 EST client backend[3764867] pg_regress/max LOG: calling entry_alloc for 'select * from t001', cur hash size 3
2023-12-31 13:21:39.160 EST client backend[3764867] pg_regress/max CONTEXT: SQL statement "select * from t001"
PL/pgSQL function inline_code_block line 4 at EXECUTE
...
2023-12-31 13:21:39.165 EST client backend[3764867] pg_regress/max LOG: calling entry_alloc for 'select * from t098', cur hash size 100
2023-12-31 13:21:39.165 EST client backend[3764867] pg_regress/max CONTEXT: SQL statement "select * from t098"
PL/pgSQL function inline_code_block line 4 at EXECUTE
2023-12-31 13:21:39.165 EST client backend[3764867] pg_regress/max LOG: entry_dealloc: zapping 10 of 100 victims

So the dealloc happens sooner than the script expects, and it's pure
chance that the test appeared to work anyway.

The test case needs to be rewritten to allow for more competing
usage-count-1 entries than it currently does. Backing "98" down to
"95" might be enough, but I've not experimented (and I'd recommend
leaving more than the minimum amount of headroom, in case plpgsql
changes behavior about how many subexpressions get put into the
table).

Strongly recommend that while fixing the test, you stick in some
debugging elogs to verify when the dealloc calls actually happen
rather than assuming you predicted it correctly. I did it as
attached.

regards, tom lane

Attachment Content-Type Size
hack-instrument-entry-alloc-and-dealloc.patch text/x-diff 939 bytes

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-12-31 18:59:23 Re: [17] CREATE SUBSCRIPTION ... SERVER
Previous Message Melanie Plageman 2023-12-31 18:36:28 Re: Streaming I/O, vectored I/O (WIP)