Re: GNU/Hurd portability patches

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Banck <mbanck(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: GNU/Hurd portability patches
Date: 2025-11-10 19:00:01
Message-ID: fa85e679-9d13-43ae-8882-3f50c709f446@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Thomas and Michael!

Sorry for the delay. I've finally completed a new round of experiments and
discovered the following:

12.10.2025 03:42, Thomas Munro wrote:
> Hmm. We only install the handler for real signal numbers, and it
> clearly managed to find the handler, so then how did it corrupt signo
> before calling the function? I wonder if there could concurrency bugs
> reached by our perhaps unusually large amount of signaling (we have
> found bugs in the signal implementations of several other OSes...).
> This might be the code:
>
> https://github.com/bminor/glibc/blob/master/hurd/hurdsig.c#L639
>
> It appears to suspend the thread selected to handle the signal, mess
> with its stack/context and then resume it, just like traditional
> monokernels, it's just done in user space by code running in a helper
> thread that communicates over Mach ports. So it looks like I
> misunderstood that comment in the docs, it's not the handler itself
> that runs in a different thread, unless I'm looking at the wrong code
> (?).
>
> Some random thoughts after skim-reading that and
> glibc/sysdeps/mach/hurd/x86/trampoline.c:
> * I wonder if setting up sigaltstack() and then using SA_ONSTACK in
> pqsignal() would behave differently, though SysV AMD64 calling
> conventions (used by Hurd IIGC) have the first argument in %rdi, not
> the stack, so I don't really expect that to be relevant...
> * I wonder about the special code paths for handlers that were already
> running and happened to be in sigreturn(), or something like that,
> which I didn't study at all, but it occurred to me that our pqsignal
> will only block the signal itself while running a handler (since it
> doesn't specify SA_NODEFER)... so what happens if you block all
> signals while running each handler by changing
> sigemptyset(&act.sa_mask) to sigfillset(&act.sa_mask)?

Thank you for the suggestion!

With this modification:
@@ -137,7 +140,7 @@ pqsignal(int signo, pqsigfunc func)

 #if !(defined(WIN32) && defined(FRONTEND))
        act.sa_handler = func;
-       sigemptyset(&act.sa_mask);
+       sigfillset(&act.sa_mask);
        act.sa_flags = SA_RESTART;

I got 100 iterations passed (12 of them hanged) without that Assert
triggered.

> * I see special code paths for SIGIO and SIGURG that I didn't try to
> understand, but I wonder what would happen if we s/SIGURG/SIGXCPU/

With sed 's/SIGURG/SIGXCPU/' -i src/backend/storage/ipc/waiteventset.c, I
still got:
!!!wrapper_handler[8401]| postgres_signal_arg: 28565808, PG_NSIG: 33
TRAP: failed Assert("postgres_signal_arg < PG_NSIG"), File: "pqsignal.c", Line: 94, PID: 8401
...
2025-11-09 12:51:24.095 GMT postmaster[7282] LOG:  client backend (PID 8401) was terminated by signal 6: Aborted
2025-11-09 12:51:24.095 GMT postmaster[7282] DETAIL:  Failed process was running: UPDATE PKTABLE set ptest2=5 where
ptest2=2;
---

!!!wrapper_handler[21000]| postgres_signal_arg: 28545040, PG_NSIG: 33
TRAP: failed Assert("postgres_signal_arg < PG_NSIG"), File: "pqsignal.c", Line: 94, PID: 21000
...
2025-11-09 13:06:59.458 GMT postmaster[20669] LOG:  client backend (PID 21000) was terminated by signal 6: Aborted
2025-11-09 13:06:59.458 GMT postmaster[20669] DETAIL:  Failed process was running: UPDATE pvactst SET i = i WHERE i < 1000;
---
!!!wrapper_handler[21973]| postgres_signal_arg: 28562608, PG_NSIG: 33
TRAP: failed Assert("postgres_signal_arg < PG_NSIG"), File: "pqsignal.c", Line: 94, PID: 21973

2025-11-09 14:56:23.955 GMT postmaster[20665] LOG:  client backend (PID 21973) was terminated by signal 6: Aborted
2025-11-09 14:56:23.955 GMT postmaster[20665] DETAIL:  Failed process was running: INSERT INTO pagg_tab_m SELECT i % 30,
i % 40, i % 50 FROM generate_series(0, 2999) i;

The failure rate is approximately 1 per 30 runs.

Besides that Assert and the hangs, I also observed:
--- /home/demo/postgresql/src/test/regress/expected/xml.out 2025-10-11 10:04:43.000000000 +0100
+++ /home/demo/postgresql/src/test/regress/results/xml.out 2025-11-10 07:20:56.000000000 +0000
@@ -1788,10 +1788,14 @@
                                          proargtypes text))
    SELECT * FROM z
    EXCEPT SELECT * FROM x;
- proname | proowner | procost | pronargs | proargnames | proargtypes
----------+----------+---------+----------+-------------+-------------
-(0 rows)
-
+ERROR:  could not parse XML document
+DETAIL:  line 1: Input is not proper UTF-8, indicate encoding !
+Bytes: 0x92 0x11 0x69 0x3C
+<data>X~R^Qi<proc><proname>pg_get_replication_slots</proname><proowner>10</proowne
+       ^
+line 1: PCDATA invalid Char value 17
+<data>X~R^Qi<proc><proname>pg_get_replication_slots</proname><proowner>10</proowne
+

TRAP: failed Assert("AllocBlockIsValid(block)"), File: "aset.c", Line: 1536, PID: 16354
...
2025-11-09 10:21:16.249 GMT postmaster[15242] LOG:  client backend (PID 16354) was terminated by signal 6: Aborted
2025-11-09 10:21:16.249 GMT postmaster[15242] DETAIL:  Failed process was running: CREATE INDEX i_bmtest_a ON bmscantest(a);
2025-11-09 10:21:16.249 GMT postmaster[15242] LOG:  terminating any other active server processes

TRAP: failed Assert("npages == tbm->npages"), File: "tidbitmap.c", Line: 825, PID: 4641
...
2025-10-14 12:09:00.555 BST postmaster[3818] LOG:  client backend (PID 4641) was terminated by signal 6: Aborted
2025-10-14 12:09:00.555 BST postmaster[3818] DETAIL:  Failed process was running: select count(*) from tenk1, tenk2
where tenk1.hundred > 1 and tenk2.thousand=0;

--- /home/demo/postgresql/src/test/regress/expected/join_hash.out 2025-10-11 10:04:34.000000000 +0100
+++ /home/demo/postgresql/src/test/regress/results/join_hash.out 2025-10-14 11:30:16.000000000 +0100
@@ -485,20 +485,12 @@
 (8 rows)

 select count(*) from simple r join extremely_skewed s using (id);
- count
--------
- 20000
-(1 row)
-
+ERROR:  could not read from temporary file: read only 411688 of 47854847 bytes

--- /home/demo/postgresql/src/test/regress/expected/bitmapops.out 2025-10-11 10:04:29.000000000 +0100
+++ /home/demo/postgresql/src/test/regress/results/bitmapops.out 2025-10-14 11:08:58.000000000 +0100
@@ -13,6 +13,10 @@
   SELECT (r%53), (r%59),
'foooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo'
   FROM generate_series(1,70000) r;
 CREATE INDEX i_bmtest_a ON bmscantest(a);
+ERROR:  index row size 6736 exceeds btree version 4 maximum 2704 for index "i_bmtest_a"

30.10.2025 17:30, Michael Banck wrote:
> I checked this, if I just run the following excerpt of
> entry_timestamp.sql in a tight loop, I get a few (<10) occurrances out
> of 10000 iterations where min/max plan time is 0 (or rather
> minmax_plan_zero is non-zero):
>
> SELECT pg_stat_statements_reset();
> SET pg_stat_statements.track_planning = TRUE;
> SELECT 1 AS "STMTTS1";
> SELECT
> count(*) as total,
> count(*) FILTER (
> WHERE min_plan_time + max_plan_time = 0
> ) as minmax_plan_zero
> FROM pg_stat_statements
> WHERE query LIKE '%STMTTS%';
>
> On the assumption that this isn't a general bug, but just a timing issue
> (planning 'SELECT 1' isn't complicated), I see two possibilities:
>
> 1. Ignore the plan times, and replace SELECT 1 with SELECT
> pg_sleep(1e-6), similar to e849bd551. I guess this would reduce test
> coverage so likely not be great?
>
> 2. Make the query a bit more complicated so that the plan time is likely
> to be non-negligable. I actually had to go quite a way to make it pretty
> failsafe, the attached made it fail less than 5 times out of 50000
> iterations, not sure whether that is acceptable or still considered
> flaky?

What concerns me is that there is also subscription.sql and maybe could
be other test(s) that expect at least 1000ns (far from infinite) timer
resolution. Probably it would make sense to define which timer resolution
we consider acceptable for tests and then to check if Hurd can provide it.

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2025-11-10 19:01:14 2025-11-13 release announcement draft
Previous Message Manni Wood 2025-11-10 18:25:31 Re: Include extension path on pg_available_extensions