Re: libpq copy error handling busted

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: libpq copy error handling busted
Date: 2020-06-04 01:53:59
Message-ID: CA+hUKG+haDNc6u0Y_QiS_5Obq0CMQVb8EBn8NQZSR=s2orLvsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 4, 2020 at 1:35 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
> > * pqSendSome() is responsible not only for pushing out data, but for
> > calling pqReadData in any situation where it can't get rid of the data
> > promptly. 1f39a1c06 overlooked that requirement, and the upshot is
> > that we don't necessarily notice that the connection is broken (it's
> > pqReadData's job to detect that). Putting a pqReadData call into
> > the early-exit path helps, but doesn't fix things completely.
>
> Ah, it's better if I put the pqReadData call into *both* the paths
> where 1f39a1c06 made pqSendSome give up. The attached patch seems
> to fix the issue for the "pgbench -i" scenario, with either fast-
> or immediate-mode server stop. I tried it with and without SSL too,
> just to see. Still, it's not clear to me whether this might worsen
> any of the situations we discussed in the lead-up to 1f39a1c06 [1].
> Thomas, are you in a position to redo any of that testing?

Yes, sure. The testing consisted of running on a system with OpenSSL
1.1.1a (older versions didn't have the problem). It originally showed
up on eelpout, a very underpowered build farm machine running Linux on
ARM64, but then later we worked out we could make it happen on a Mac
or any other Linux system if we had bad enough luck or if we added a
sleep in a particular spot. We could do it with psql running in a
loop using a bad certificate from the testing setup stuff, as shown
here:

https://www.postgresql.org/message-id/CA%2BhUKGJafyTgpsYBgQGt1EX0O8UnL4VGHSc7J0KZyMH4_jPGBw%40mail.gmail.com

I don't have access to eelpout from where I am right now, but I'll try
that test now on the Debian 10 amd64 system I have here. OpenSSL has
since moved on to 1.1.1d-0+deb10u3, but that should be fine, the
triggering change was the move to TLS1.3 so let me see what happens if
I do that with your patch applied...

> > * The more longstanding problem is that the PQputCopyData code path
> > doesn't have any mechanism for consuming an 'E' (error) message
> > once pqReadData has collected it.
>
> At least with pgbench's approach (die immediately on PQputline failure)
> this isn't very relevant once we apply the attached. Perhaps we should
> revisit this behavior anyway, but I'd be afraid to back-patch a change
> of that nature.
>
> > * As for control-C not getting out of it: there is
> > if (CancelRequested)
> > break;
> > in pgbench's loop, but this does nothing in this scenario because
> > fe-utils/cancel.c only sets that flag when it successfully sends a
> > Cancel ... which it certainly cannot if the postmaster is gone.
>
> I'll send a patch for this later.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-06-04 01:57:33 Re: elog(DEBUG2 in SpinLocked section.
Previous Message Michael Paquier 2020-06-04 01:43:56 Re: elog(DEBUG2 in SpinLocked section.