Re: Non-reproducible AIO failure

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Non-reproducible AIO failure
Date: 2025-06-12 05:03:22
Message-ID: 1fea555c-0345-46dc-8da5-5e667cad436a@garret.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I tried to catch moment when memory is changed using mprotect.
I have aligned PgAioHandle on page boundary (16kb at MacOS), and disable
writes in `pgaio_io_reclaim`:
```
static void
pgaio_io_reclaim(PgAioHandle *ioh)
{
   RESUME_INTERRUPTS();
    rc = mprotect(ioh, sizeof(*ioh), PROT_READ);
    Assert(rc == 0);
fprintf(stderr, "!!!pgaio_io_reclaim [%d]| ioh: %p, ioh->op: %d,
ioh->generation: %llu\n", getpid(), ioh, ioh->op, ioh->generation);
}

```

and reenable writes in `pgaio_io_before_start` and `pgaio_io_acquire_nb`:

```

static void
pgaio_io_before_start(PgAioHandle *ioh)
{
    int rc = mprotect(ioh, sizeof(*ioh), PROT_READ|PROT_WRITE);
    Assert(rc == 0);

```

and

```
PgAioHandle *
pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
{
     ...

        ioh = dclist_container(PgAioHandle, node, ion);

        Assert(ioh->state == PGAIO_HS_IDLE);
        Assert(ioh->owner_procno == MyProcNumber);

        rc = mprotect(ioh, sizeof(*ioh), PROT_READ|PROT_WRITE);
        Assert(rc == 0);
}

```

The error is reproduced after 133 iterations:
```
!!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0,
ioh->generation: 19346
!!!AsyncReadBuffers [20376] (1)| blocknum: 21, ioh: 0x1019bc000,
ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0,
ioh->generation: 19346
2025-06-12 01:05:31.865 EEST [20376:918] pg_regress/psql LOG:
!!!pgaio_io_before_start| ioh: 0x1019bc000, ioh->op: 1, ioh->state: 1,
ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 19346
```

But no write protection violation happen.
Do not know how to interpret this fact. Changes are made by kernel?
`pgaio_io_acquire_nb` was called between `pgaio_io_reclaim` and
`pgaio_io_before_start`?

I am now going add trace to `pgaio_io_acquire_nb`.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2025-06-12 05:14:30 Re: Replication slot is not able to sync up
Previous Message shveta malik 2025-06-12 04:49:57 Re: Fix slot synchronization with two_phase decoding enabled