From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: stats.sql might fail due to shared buffers also used by parallel tests |
Date: | 2025-07-04 20:00:00 |
Message-ID: | e9d78260-d401-406b-9e3a-9f939f27359b@gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Shlok,
03.07.2025 09:54, Shlok Kyal wrote:
> I have also encountered a similar buildfarm failure [1].
>
> | 1/1 + subscription 142 ms FAIL
> 1/1 postgresql:regress / regress/regress ERROR 284.85s exit status 1
>
> diff --strip-trailing-cr -U3
> c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
> c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
> --- c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
> 2025-06-28 20:13:02 +0900
> +++ c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
> 2025-06-28 20:35:21 +0900
> @@ -70,7 +70,7 @@
> SELECT :'prev_stats_reset' < stats_reset FROM
> pg_stat_subscription_stats WHERE subname = 'regress_testsub';
> ?column?
> ----------
> - t
> + f
> (1 row)
Thank you for your attention to this! Yes, I also noticed that failure,
but could not reproduce it yet.
> So according to me, I suspect the following causes
> 1) The time difference between 'prev_stats_reset' and current
> 'stats_reset' value is less than 1 microseconds.
> 'stats_reset' is of type 'timestamp with time zone' and the content of
> it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time
> difference between 'prev_stats_reset' and current 'stats_reset' is
> less than 1 microseconds. The query 'SELECT :'prev_stats_reset' <
> stats_reset FROM pg_stat_subscription_stats WHERE subname =
> 'regress_testsub'' might return 'false' instead of 'true'.
> But I was not able to reproduce such a scenario after multiple
> testing. Even in high end machines, it takes at least a few
> microseconds. Also there are multiple cases where we did similar
> timestamp comparison in 'stats.sql' as well. And, I didn't find any
> other failure related to such case. So, I feel this is not possible.
Did you try that on Windows (hamerkop is a Windows animal)? IIUC,
GetCurrentTimestamp() -> gettimeofday() implemented on Windows via
GetSystemTimePreciseAsFileTime(), and it has 100ns resolution,
according to [1].
I tried it on a Windows VM and saw the following numbers:
SELECT :'prev_stats_reset' < stats_reset, :'prev_stats_reset', stats_reset FROM pg_stat_subscription_stats WHERE subname
= 'regress_testsub';
?column? | ?column? | stats_reset
----------+-------------------------------------+-------------------------------------
t | Fri Jul 04 11:46:10.224323 2025 PDT | Fri Jul 04 11:46:10.224975 2025 PDT
(1 row)
that is, the difference is about 600 ns — if I could make the VM 6x faster
or run the test on a bare metal, maybe I'd get the test failed.
Although the overall duration of the subscription doesn't differ much:
My VM:
# parallel group (2 tests): subscription publication
ok 164 + publication 806 ms
ok 165 + subscription 210 ms
vs
hamerkop:
# parallel group (2 tests): subscription publication
ok 164 + publication 2383 ms
not ok 165 + subscription 142 ms
So probably the failure is caused by something else.
> 2) pg_stat_reset_subscription_stats(oid) function did not reset the stats.
> We have a shared hash 'pgStatLocal.shared_hash'. If the entry
> reference (for the subscription) is not found while executing
> 'pg_stat_reset_subscription_stats(oid)'. It may not be able to reset
> the stats. Maybe somehow this shared hash is getting dropped..
> Also, it could be failing due to the same reason as Alexander has
I don't think 2) is relevant here, because shared buffers shouldn't affect
subscription's statistics.
[1] https://stackoverflow.com/questions/4444090/measuring-time-with-a-resolution-of-microseconds-in-c
Best regards.
Alexander
From | Date | Subject | |
---|---|---|---|
Next Message | Noah Misch | 2025-07-05 00:16:28 | walwriter can set XLP_BKP_REMOVABLE wrongly: race w/ backup start |
Previous Message | Tomas Vondra | 2025-07-04 18:12:01 | Re: Adding basic NUMA awareness |