| From: | Noah Misch <noah(at)leadboat(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at> |
| Subject: | Re: AIO v2.5 |
| Date: | 2025-05-09 02:22:27 |
| Message-ID: | 20250509022227.5b.nmisch@google.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, May 08, 2025 at 09:06:18PM -0400, Andres Freund wrote:
> On 2025-05-02 20:05:11 -0700, Noah Misch wrote:
> > On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:
> We do need to hold interrupts in a few other places, I think - with some debug
> infrastructure (things like calling ProcessBarrierSmgrRelease() whenever
> interrupts could be processed and calling CFI() in errstart() in its return
> false case) it's possible to find state confusions which trigger
> assertions. The issue is that pgaio_io_update_state() contains a
> pgaio_debug_io() and executing pgaio_closing_fd() in places that call
> pgaio_io_update_state() doesn't end well. There's a similar danger with the
> debug message in pgaio_io_reclaim().
>
> In the attached patch I added an assertion to pgaio_io_update_state()
> verifying that interrupts are held and added code to hold interrupts in the
> relevant places.
Works for me.
> > For the "no free IOs despite no in-flight IOs" case, I'd replace the
> > ereport(ERROR) with "return;" since we now know interrupt processing reclaimed
> > an IO.
>
> Hm - it seems better to me to check if there are now free handles and return
> if that's the case, but to keep the error check in case there actually is no
> free IO? That seems like a not implausible bug...
Works for me.
> > Then decide what protection if any, we need against bugs causing an
> > infinite loop in caller pgaio_io_acquire(). What's the case motivating the
> > unbounded loop in pgaio_io_acquire(), as opposed to capping at two
> > pgaio_io_acquire_nb() calls? If the theory is that pgaio_io_acquire() could
> > be reentrant, what scenario would reach that reentrancy?
>
> I do not remember why I wrote this as an endless loop. If you prefer I could
> change that as part of this patch.
I asked because it would be sad to remove the ereport(ERROR) like I proposed
and then have a bug cause a real infinite loop. Removing the loop was one way
to prove that can't happen. As you say, another way would be keeping the
ereport(ERROR) and guarding it with a free-handles check, like in your patch
today. I don't have a strong preference between those.
> It does seem rather dangerous that errstart() processes interrupts for debug
> messages, but only if the debug message is actually logged. That's really a
> recipe for hard to find bugs. I wonder if we should, at least in assertion
> mode, process interrupts even if not emitting the message.
Yes, that sounds excellent to have.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Richard Guo | 2025-05-09 02:47:19 | Re: PG 18 release notes draft committed |
| Previous Message | Zhijie Hou (Fujitsu) | 2025-05-09 02:13:01 | RE: Fix slot synchronization with two_phase decoding enabled |