From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Michael Banck <mbanck(at)gmx(dot)net> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: GNU/Hurd portability patches |
Date: | 2025-10-12 13:00:00 |
Message-ID: | b87a0112-6235-4d87-886b-d1c79c0e0543@gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Michael,
12.10.2025 11:31, Michael Banck wrote:
>
> Any way to easily reproduce this? It happened only once on fruitcrow so
> far.
I'd say it happens pretty often when `make check` doesn't hang (so it
takes an hour or two for me to reproduce).
Though now that you've mentioned MAX_CONNECTIONS => '3', I also tried:
EXTRA_REGRESS_OPTS="--max-connections=3" make -s check
and it passed 6 iterations for me. Iteration 7 failed with:
not ok 213 + partition_aggregate 1027 ms
--- /home/demo/postgresql/src/test/regress/expected/partition_aggregate.out 2025-10-11 10:04:36.000000000 +0100
+++ /home/demo/postgresql/src/test/regress/results/partition_aggregate.out 2025-10-12 13:02:05.000000000 +0100
@@ -1476,14 +1476,14 @@
(15 rows)
SELECT x, sum(y), avg(y), sum(x+y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) < 7 ORDER BY 1, 2, 3;
- x | sum | avg | sum | count
-----+------+--------------------+-------+-------
- 0 | 5000 | 5.0000000000000000 | 5000 | 1000
- 1 | 6000 | 6.0000000000000000 | 7000 | 1000
- 10 | 5000 | 5.0000000000000000 | 15000 | 1000
- 11 | 6000 | 6.0000000000000000 | 17000 | 1000
- 20 | 5000 | 5.0000000000000000 | 25000 | 1000
- 21 | 6000 | 6.0000000000000000 | 27000 | 1000
+ x | sum | avg | sum | count
+----+------+----------------------------+-------+-------
+ 0 | 5000 | 5.0000000000000000 | 5000 | 1000
+ 1 | 6000 | 6.0000000000000000 | 7000 | 1000
+ 10 | 5000 | 0.000000052757140846001326 | 15000 | 1000
+ 11 | 6000 | 6.0000000000000000 | 17000 | 1000
+ 20 | 5000 | 5.0000000000000000 | 25000 | 1000
+ 21 | 6000 | 6.0000000000000000 | 27000 | 1000
(6 rows)
Then another 6 iterations passed, seventh one hanged. Then 10 iterations
passed.
With EXTRA_REGRESS_OPTS="--max-connections=10" make -s check, I got:
2025-10-12 13:52:58.559 BST client backend[15475] pg_regress/constraints STATEMENT: ALTER TABLE notnull_tbl2 ALTER a
DROP NOT NULL;
!!!wrapper_handler[15479]| postgres_signal_arg: 30, PG_NSIG: 33
!!!wrapper_handler[15476]| postgres_signal_arg: 30, PG_NSIG: 33
!!!wrapper_handler[15476]| postgres_signal_arg: 28481392, PG_NSIG: 33
TRAP: failed Assert("postgres_signal_arg < PG_NSIG"), File: "pqsignal.c", Line: 94, PID: 15476
postgres(ExceptionalCondition+0x5a) [0x1006af78a]
postgres(+0x70f59a) [0x10070f59a]
/lib/x86_64-gnu/libc.so.0.3(+0x39fee) [0x102b89fee]
/lib/x86_64-gnu/libc.so.0.3(+0x39fdd) [0x102b89fdd]
on iteration 5.
So we can conclude that the issue with signals is better reproduced with
higher concurrency.
28481392 (0x1b29770) is pretty close to 28476608 (0x1b284c0), which I
showed before, so numbers are apparently not random.
> I had to reboot fruitcrow last night because it had crashed, but that
> was the first time in literally weeks. I tend to reboot it once a week,
> but otherwise it ran pretty stable.
Today I also tried to test my machine with stress-ng:
stress-ng -v --class os --sequential 20 --timeout 120s
It hanged/crashed at tests access, brk, close, enosys and never reached
the end... Some tests might pass after restart, some fail consistently...
For example:
Fatal glibc error: ../sysdeps/mach/hurd/mig-reply.c:73 (__mig_dealloc_reply_port): assertion failed: port == arg
stress-ng: info: [9395] stressor terminated with unexpected signal 6 'SIGABRT'
backtrace:
stress-ng-enosys [run](+0xace81) [0x1000ace81]
stress-ng-enosys [run](+0x927b6c) [0x100927b6c]
/lib/x86_64-gnu/libc.so.0.3(+0x39fee) [0x1029c8fee]
/lib/x86_64-gnu/libc.so.0.3(+0x21aec) [0x1029b0aec]
> It took me a while to get there though before I applied for it to be a
> buildfarm animal, here is what I did:
>
> 1) (builfarm client specific): removed "HEAD => ['debug_parallel_query =
> regress']," and set "MAX_CONNECTIONS => '3'," in build-farm.conf, to
> reduce concurrency.
Thank you for the info! I didn't specify debug_parallel_query for
`make check`, but num_connections really makes the difference.
> 2. Gave it 4G of memory to the VM via KVM. Also set -M q35, but I guess
> you are already doing that as it does not boot properly otherwise IME.
Mine has 4GB too.
> 3. Removed swap (this is already the case for the x86-64 2025 Debian
> image, but it was not the case for the earlier 2023 i386 image when I
> started this project). Paging to disk has been problematic and prone to
> issues (critical parts getting paged out accidently), but this has been
> fixed over the summer so in principle running a current gnumach/hurd
> package combination from unstable should be fine again.
Yes, I have no swap enabled.
> 4. Removed tmpfs translators (so that the default-pager is not used
> anywhere, in conjunction with not setting swap, see above), by setting
> RAMLOCK=no and RAMTMP=no in /etc/default/tmpfs, as well as commenting
> out 'mount_run mount_noupdate'/'mount_tmp mount_noupdate' in
> /etc/init.d/mountall.sh and 'mount_run "$MNTMODE"' in
> /etc/init.d/mountkernfs.sh (maybe there is a more minimal change, but
> that is what I have right now).
I have RAMLOCK=no and RAMTMP=no in my /etc/default/tmpfs and can't see any
tmpfs mounts.
Thank you for your help!
Best regards,
Alexander
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Burd | 2025-10-12 13:10:27 | Re: IO in wrong state on riscv64 |
Previous Message | Tatsuo Ishii | 2025-10-12 12:05:30 | Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options |