From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
---|---|
To: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Non-reproducible AIO failure |
Date: | 2025-06-05 22:13:52 |
Message-ID: | CAEze2WhTcBjgzOPzD0xYdw-F-HjYzGNKEyQB9=FP9XneFZWESg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 5 Jun 2025 at 21:00, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> Hello Thomas and Andres,
>
> 04.06.2025 23:32, Thomas Munro wrote:
> > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
> >>> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
> >>> 2, ioh->generation: 21694
> >> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
> >> "normal" reason for that. We know that the IO wasn't actually started as
> >> otherwise pgaio_io_start_readv() would have logged that fact.
> > A cheap/easy thing to try, maybe: give that enumeration more
> > distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get
> > a wild 1 here?
I have a very wild guess that's probably wrong in a weird way, but
here goes anyway:
Did anyone test if interleaving the enum-typed bitfield fields of
PgAioHandle with the uint8 fields might solve the issue? I.e. {state,
flags, target, num_callbacks, op} instead of {state, target, op,
flags, num_callbacks}?
I've looked at some decompiled code generated for
pgaio_io_before_start and pgaio_io_start_readv. pgaio_io_start_readv
looks rather normal, but in pgaio_io_before_start it's clear that
pgaio_io_before_start does some weird things as a response to the
bitfields of PgAioHandle:
(manually annotated)
_pgaio_io_before_start:
00000001002eaba8 stp x20, x19, [sp, #-0x20]! /* save registers */
00000001002eabac stp x29, x30, [sp, #0x10] /* save registers */
00000001002eabb0 add x29, sp, #0x10 /* x29 := stack+10 */
00000001002eabb4 ldrb w8, [x0] /* w8 := ioh->state */
00000001002eabb8 cmp w8, #0x1 /* w8 ??
PGAIO_HS_HANDED_OUT */
00000001002eabbc b.ne 0x1002eac38
w8 here will contain the value 0x1.
00000001002eabc0 mov x19, x0 /* x19 := ioh */
00000001002eabc4 adrp x8, 1461 ; 0x10089f000 /* x8 := 0x10089f000 */
00000001002eabc8 add x8, x8, #0xd40 /* x8 += 0xd40
(0x10089fd40) */
00000001002eabcc ldr x8, [x8] /* x8 :=
pgaio_my_backend */
00000001002eabd0 ldr x8, [x8, #0x20] /* x8 :=
pgaio_my_backend->handed_out_io */
00000001002eabd4 cmp x8, x0 /* x8
(pgaio_my_backend->handed_out_io) ?? x19 (ioh) */
00000001002eabd8 b.ne 0x1002eac50
00000001002eabdc mov x0, x19 /* x0 := ioh */
00000001002eabe0 bl _pgaio_io_has_target
00000001002eabe4 tbz w0, #0x0, 0x1002eac68 /*
_pgaio_io_has_target(ioh)? */
00000001002eabe8 ldrb w8, [x19, #0x2] /* w8 := ioh->op */
w8 now contains ioh->op, which is presumed to be 0x0.
00000001002eabec ldrh w9, [x19] /* w9 :=
ioh->[target | state] */
00000001002eabf0 orr w8, w9, w8, lsl #16 /* w8 := w9 |
(w8 << 16) // [ op | [target | state] ]*/
But here more operations are applied, pushing in the preceding
bitfields of PgAioHandle, mixing in various values into w8.
00000001002eabf4 cmp w8, #0x10, lsl #12 /* w8 ??
0x1'00'00 // op'state'target ?? 0x1'00'00 */
And finally the full set of bitfields is compared against a target
value, rather than just ioh->op. This means we do 2 memory accesses
for this field.
00000001002eabf8 b.hs 0x1002eac80 /* >= ?? ERROR : continue */
And finally we branch into the setup for _ExceptionalCondition.
This shows that b->op is not exactly a stand-alone field, and might be
influenced by the other sibling bitfield fields. If we changed the
order of the fields to not have continuous bitfields, that might be
able to convince the compiler that there are no data dependencies
between state/target/op, and remove the two-phase load operation.
(checks more decompiled code)
Oh, the bitfields are not just read with multiple accesses, but also
*written* using multiple accesses. See pgaio_io_set_target:
_pgaio_io_set_target:
00000001002eafdc stp x29, x30, [sp, #-0x10]!
00000001002eafe0 mov x29, sp
00000001002eafe4 ldrb w8, [x0, #0x2]
00000001002eafe8 ldrh w9, [x0]
00000001002eafec orr w8, w9, w8, lsl #16
00000001002eaff0 and w9, w8, #0xff
00000001002eaff4 cmp w9, #0x1
00000001002eaff8 b.ne 0x1002eb01c
00000001002eaffc tst w8, #0xff00
00000001002eb000 b.ne 0x1002eb034
00000001002eb004 lsr w9, w8, #16
00000001002eb008 bfi w8, w1, #8, #24
00000001002eb00c strb w9, [x0, #0x2] /* ioh->op = w9 */
00000001002eb010 strh w8, [x0] /* ioh->[state, target] = w8 */
... and pgaio_io_update_state:
_pgaio_io_update_state:
00000001002e7f08 sub sp, sp, #0x70
00000001002e7f0c stp x24, x23, [sp, #0x30]
00000001002e7f10 stp x22, x21, [sp, #0x40]
00000001002e7f14 stp x20, x19, [sp, #0x50]
00000001002e7f18 stp x29, x30, [sp, #0x60]
00000001002e7f1c add x29, sp, #0x60
00000001002e7f20 mov x19, x1
00000001002e7f24 mov x20, x0
00000001002e7f28 mov w0, #0xa
00000001002e7f2c mov x1, #0x0
00000001002e7f30 bl _errstart
00000001002e7f34 cbz w0, 0x1002e8010
[...] /* skip pgaio_debug_io log handling */
00000001002e8010 dmb ish
00000001002e8014 ldurh w8, [x20, #0x1]
00000001002e8018 bfi w19, w8, #8, #16
00000001002e801c lsr w8, w8, #8
00000001002e8020 strb w8, [x20, #0x2] /* store ioh->op */
00000001002e8024 strh w19, [x20] /* store ioh->{state, target} */
I'm sending this now without looking much deeper into this due to
$err_date_changed and other time-related constraints, but might this
be a candidate for further debugging research?
Kind regards,
Matthias van de Meent
Databricks Inc.
From | Date | Subject | |
---|---|---|---|
Next Message | Cary Huang | 2025-06-05 22:31:44 | Re: sslinfo extension - add notbefore and notafter timestamps |
Previous Message | Cary Huang | 2025-06-05 21:24:25 | Re: Support tid range scan in parallel? |