Re: stats.sql might fail due to shared buffers also used by parallel tests

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: stats.sql might fail due to shared buffers also used by parallel tests
Date: 2025-07-03 06:54:50
Message-ID: CANhcyEX4hH9POyTM3vh=58newEF0=qgK46xF5i-RDir2zAZ4og@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 6 Jan 2025 at 12:30, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> Hello hackers,
>
> I'd like to share my investigation of one mysterious stats.sql failure
> occurred in December: [1].
> The difference of the failure is:
> --- C:/prog/bf/root/HEAD/pgsql/src/test/regress/expected/stats.out 2024-09-18 19:31:14.665516500 +0000
> +++ C:/prog/bf/root/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2024-12-17
> 09:57:08.944588500 +0000
> @@ -1291,7 +1291,7 @@
> SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
> ?column?
> ----------
> - t
> + f
> (1 row)
>
> 027_stream_regress_primary.log contains:
> 2024-12-17 09:57:06.778 UTC [8568:109] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
> FROM pg_stat_io
> WHERE object = 'relation'
> 2024-12-17 09:57:06.779 UTC [8568:110] pg_regress/stats LOG: statement: CREATE TABLE test_io_shared(a int);
> 2024-12-17 09:57:06.780 UTC [8568:111] pg_regress/stats LOG: statement: INSERT INTO test_io_shared SELECT i FROM
> generate_series(1,100)i;
> ...
> 2024-12-17 09:57:06.782 UTC [8568:115] pg_regress/stats LOG: statement: CHECKPOINT;
> 2024-12-17 09:57:06.794 UTC [3664:40] LOG: checkpoint starting: immediate force wait
> 2024-12-17 09:57:06.856 UTC [3664:41] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 1 SLRU buffers; 0 WAL
> file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.062 s; sync files=0, longest=0.000 s,
> average=0.000 s; distance=1875 kB, estimate=52682 kB; lsn=0/14A2F410, redo lsn=0/14A2F3B8
> 2024-12-17 09:57:06.857 UTC [8568:116] pg_regress/stats LOG: statement: CHECKPOINT;
> 2024-12-17 09:57:06.857 UTC [3664:42] LOG: checkpoint starting: immediate force wait
> 2024-12-17 09:57:06.859 UTC [3664:43] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
> file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s,
> average=0.000 s; distance=0 kB, estimate=47414 kB; lsn=0/14A2F4E0, redo lsn=0/14A2F488
> 2024-12-17 09:57:06.859 UTC [8568:117] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
> FROM pg_stat_io
> WHERE object = 'relation'
> 2024-12-17 09:57:06.860 UTC [8568:118] pg_regress/stats LOG: statement: SELECT 77693 > 77693;
>
> The corresponding test case (added by 10a082bf7, dated 2023-02-11) is:
> SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
> FROM pg_stat_io
> WHERE object = 'relation' \gset io_sum_shared_before_
> ...
> CREATE TABLE test_io_shared(a int);
> INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
> ...
> CHECKPOINT;
> CHECKPOINT;
> ...
> SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
> FROM pg_stat_get_backend_io(pg_backend_pid())
> WHERE object = 'relation' \gset my_io_sum_shared_after_
> SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
>
> So, as we can see from the log above, both checkpoints wrote 0 buffers,
> moreover, no other process/backend wrote any buffers.
>
> After series of experiments, I found that this situation is possible when
> other backend steals shared buffers from checkpointer, without making new
> buffers dirty, and then delays flushing it's statistics.
>
> Please find attached a reproducer for this issue: I've added two sleeps
> to stats.sql and reduced parallel_schedule, but I believe it's a legal
> change, which doesn't affect the test case essentially.
>
> With bufchurn.sql running concurrently. and TEMP_CONFIG containing
> shared_buffers = 1MB (as in the 027_stream_regress test in question), I can
> see exactly the same failure:
> $ echo 'shared_buffers = 1MB' >/tmp/extra.config; TEMP_CONFIG=/tmp/extra.config make -s check
> # +++ regress check in src/test/regress +++
> # initializing database system by copying initdb template
> # using temp instance on port 65312 with PID 129700
> ok 1 - test_setup 356 ms
> ok 2 - create_index 1002 ms
> # parallel group (2 tests): stats bufchurn
> not ok 3 + stats 21176 ms
> ...
>
> diff -U3 .../src/test/regress/expected/stats.out .../src/test/regress/results/stats.out
> --- .../src/test/regress/expected/stats.out 2025-01-06 08:03:07.720275980 +0200
> +++ .../src/test/regress/results/stats.out 2025-01-06 08:15:34.602170721 +0200
> @@ -1321,7 +1321,7 @@
> SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
> ?column?
> ----------
> - t
> + f
> (1 row)
> ...
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-12-17%2008%3A59%3A44
>

Hi Hackers,

I have also encountered a similar buildfarm failure [1].

| 1/1 + subscription 142 ms FAIL
1/1 postgresql:regress / regress/regress ERROR 284.85s exit status 1

diff --strip-trailing-cr -U3
c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
--- c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
2025-06-28 20:13:02 +0900
+++ c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
2025-06-28 20:35:21 +0900
@@ -70,7 +70,7 @@
SELECT :'prev_stats_reset' < stats_reset FROM
pg_stat_subscription_stats WHERE subname = 'regress_testsub';
?column?
----------
- t
+ f
(1 row)

logs around this statement do not show unexpected thing:
2025-06-28 20:35:20.857 JST client backend[6092]
pg_regress/subscription STATEMENT: CREATE SUBSCRIPTION
regress_testsub CONNECTION 'testconn' PUBLICATION testpub;
2025-06-28 20:35:20.857 JST client backend[6092]
pg_regress/subscription ERROR: publication name "foo" used more than
once
2025-06-28 20:35:20.857 JST client backend[6092]
pg_regress/subscription STATEMENT: CREATE SUBSCRIPTION
regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION
foo, testpub, foo WITH (connect = false);
2025-06-28 20:35:20.858 JST client backend[6092]
pg_regress/subscription WARNING: subscription was created, but is not
connected
2025-06-28 20:35:20.858 JST client backend[6092]
pg_regress/subscription HINT: To initiate replication, you must
manually create the replication slot, enable the subscription, and
refresh the subscription.
2025-06-28 20:35:20.861 JST client backend[6092]
pg_regress/subscription ERROR: subscription "regress_testsub" already
exists

In subscription.sql, I looked at the test. The statement are like:
-- Reset the stats again and check if the new reset_stats is updated.
SELECT stats_reset as prev_stats_reset FROM pg_stat_subscription_stats
WHERE subname = 'regress_testsub' \gset
SELECT pg_stat_reset_subscription_stats(oid) FROM pg_subscription
WHERE subname = 'regress_testsub';
SELECT :'prev_stats_reset' < stats_reset FROM
pg_stat_subscription_stats WHERE subname = 'regress_testsub';

So according to me, I suspect the following causes
1) The time difference between 'prev_stats_reset' and current
'stats_reset' value is less than 1 microseconds.
'stats_reset' is of type 'timestamp with time zone' and the content of
it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time
difference between 'prev_stats_reset' and current 'stats_reset' is
less than 1 microseconds. The query 'SELECT :'prev_stats_reset' <
stats_reset FROM pg_stat_subscription_stats WHERE subname =
'regress_testsub'' might return 'false' instead of 'true'.
But I was not able to reproduce such a scenario after multiple
testing. Even in high end machines, it takes at least a few
microseconds. Also there are multiple cases where we did similar
timestamp comparison in 'stats.sql' as well. And, I didn't find any
other failure related to such case. So, I feel this is not possible.

2) pg_stat_reset_subscription_stats(oid) function did not reset the stats.
We have a shared hash 'pgStatLocal.shared_hash'. If the entry
reference (for the subscription) is not found while executing
'pg_stat_reset_subscription_stats(oid)'. It may not be able to reset
the stats. Maybe somehow this shared hash is getting dropped..
Also, it could be failing due to the same reason as Alexander has
reproduced in another case [2].

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2025-06-28%2011%3A02%3A30
[2]: https://www.postgresql.org/message-id/fe0391a8-dfa9-41c3-bf1c-7ea058e40f30%40gmail.com

Thanks and Regards,
Shlok Kyal

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2025-07-03 07:27:42 Re: Explicitly enable meson features in CI
Previous Message Rahila Syed 2025-07-03 06:44:55 Re: Add pg_get_injection_points() for information of injection points