Re: Non-reproducible AIO failure

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Non-reproducible AIO failure
Date: 2025-06-18 07:32:08
Message-ID: d8871a00-415f-4d31-a5ae-f0c075046d76@garret.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 17/06/2025 6:08 pm, Andres Freund wrote:
>
> I don't think it can - this must be an independent bug from the one that Tom
> and I were encountering.

I see... It's a pity.

By the way, I have a questions concerning using interrupts in AIO.
The comments say:

pgaio_io_release(PgAioHandle *ioh)
                /*
                 * Note that no interrupts are processed between
                 * pgaio_io_was_recycled() and this check - that's
important
                 * as otherwise an interrupt could have already
reclaimed the
                 * handle.
                 */

pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
    /*
     * All callers need to have held interrupts in some form, otherwise
     * interrupt processing could wait for the IO to complete, while in an
     * intermediary state.
     */
...

But I failed to understand how handle can be reclaimed by interrupt or
how any other AIO processing activity can be done  in interrupt handlers,
`IoWorkerMain` is not registering some IO specific interrupts. Can you
explain please how interrupts can affect AIO, because I suspect that
interrupts may be the only possible explanation of such behavior?

Also I tried to write small test reproducing AIO data flow:

#include <assert.h>
#include <pthread.h>

#define read_barrier() __atomic_thread_fence(__ATOMIC_ACQUIRE)
#define write_barrier() __atomic_thread_fence(__ATOMIC_RELEASE)

typedef struct {
    int state:8;
    int target:8;
    int op:8;
    int result;
} Handle;

enum State { IDLE, GO, DONE };
enum Operation { NOP, READ };

void* io_thread_proc(void* arg)
{
    Handle* h = (Handle*)arg;
    while (1)
    {
        if (h->state == GO)
        {
            assert(h->op == READ);
            h->result += 1;
            write_barrier();
            h->state = DONE;
        }
    }
    return  0;
}

void* client_thread_proc(void* arg)
{
    Handle* h = (Handle*)arg;
    int expected_result = 0;
    while (1)
    {
        assert(h->op == NOP);
        assert(h->state == IDLE);
        h->op = READ;
        write_barrier();
        h->state = GO;
        while (h->state != DONE);
        read_barrier();
        h->op = NOP;
        expected_result += 1;
        assert(h->result == expected_result);
        write_barrier();
        h->state = IDLE;
    }
    return  0;
}

int main() {
    void* res;
    pthread_t client_thread, io_thread;
    Handle h = {IDLE, 0, NOP, 0};
    pthread_create(&client_thread, NULL, client_thread_proc, &h);
    pthread_create(&io_thread, NULL, io_thread_proc, &h);
    pthread_join(client_thread, &res);
    pthread_join(io_thread, res);
    return 0;
}

It certainly works without any problems (well, I have not run it for
hours, but I do not think that it is needed).
Do you think that this test is doing something similar as Postgres AIO
or something should be changed (certainly AIO is not doing busy loop
like this test, but unlikely it is important for reproducing the problem).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2025-06-18 07:35:34 Re: minimum Meson version
Previous Message Hayato Kuroda (Fujitsu) 2025-06-18 07:25:32 RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly