Re: postgres_fdw test timeouts

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: postgres_fdw test timeouts
Date: 2023-12-07 09:00:01
Message-ID: 8051993b-e7ae-42fc-4b57-ef4912d50035@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Thomas,

03.12.2023 09:00, Thomas Munro wrote:
> On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>> I've managed to reproduce the failure locally when running postgres_fdw_x/
>> regress in parallel (--num-processes 10). It reproduced for me on
>> on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations
>> passed).
>>
>> I'm going to investigate this case within days. Maybe we could find a
>> better fix for the issue.
> Thanks. One thing I can recommend to anyone trying to understand the
> change is that you view it with:

I think, I've found out what's going on here.
The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of
ExecAppendAsyncEventWait().
For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop,
where ExecAsyncNotify() is called for the first AsyncRequest, but the
second one also processed inside, through a recursive call to
ExecAppendAsyncEventWait():
 -> ExecAsyncNotify -> produce_tuple_asynchronously
-> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst
-> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend
-> ExecAppendAsyncEventWait.
Here we get into the first loop and call ExecAsyncConfigureWait() for the
second AsyncRequest (because we haven't reset it's callback_pending yet),
and it leads to creating another WaitEvent for the second socket inside
postgresForeignAsyncConfigureWait():
    AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ...

This WaitEvent seemingly misses an event that we should get for that socket.
It's not that important to get noccured > 1 in
ExecAppendAsyncEventWait() to see the failure, it's enough to call
WSAEnumNetworkEvents() inside WaitEventSetWaitBlock() for the second socket
(I tried to exit from the WaitEventSetWaitBlock's new loop prematurely,
without touching occurred_events, returned_events on a second iteration of
the loop).

So it looks like we have the same issue with multiple event handles
associated with a single socket here.
And v2-0003-Redesign-Windows-socket-event-management.patch from [1]
"surprisingly" helps in this case as well (I could not see a failure for
100 iterations of 10 tests in parallel).

[1] https://www.postgresql.org/message-id/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-12-07 09:07:26 Re: Synchronizing slots from primary to standby
Previous Message Junwang Zhao 2023-12-07 08:46:53 Re: Make COPY format extendable: Extract COPY TO format implementations