Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

From: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, eric(dot)cyr(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT
Date: 2018-11-07 03:44:33
Message-ID: 5BE25FA1.5070308@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

(2018/11/06 19:50), Thomas Munro wrote:
> On Wed, Oct 24, 2018 at 1:21 AM Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I wrote:
>>> =?utf-8?q?PG_Bug_reporting_form?=<noreply(at)postgresql(dot)org> writes:
>>>> SELECT * FROM test_file_fdw_program_limit LIMIT 0;
>>>> /*
>>>> [38000] ERROR: program "echo "test"" failed Detail: child process exited
>>>> with exit code 1
>>>> */
>>
>>> Yeah, I can reproduce this on macOS as well as Linux. Capturing stderr
>>> shows something pretty unsurprising:
>>> sh: line 1: echo: write error: Broken pipe
>>> So the called program is behaving in a somewhat reasonable way: it's
>>> detecting EPIPE on its stdout (after we close the pipe), reporting that,
>>> and doing exit(1).
>>> Unfortunately, it's not clear what we could do about that, short of
>>> always reading the whole program output, which is likely to be a
>>> cure worse than the disease. If the program were failing thanks to
>>> SIGPIPE, we could recognize that as a case we can ignore ... but with
>>> behavior like this, I don't see a reliable way to tell it apart from
>>> a generic program failure, which surely we'd better report.
>>
>> After a bit of thought, the problem here is blindingly obvious:
>> we generally run the backend with SIGPIPE handing set to SIG_IGN,
>> and evidently popen() allows the called program to inherit that,
>> at least on these platforms.
>>
>> So what we need to do is make sure the called program inherits SIG_DFL
>> handling for SIGPIPE, and then special-case that result as not being
>> a failure. The attached POC patch does that and successfully makes
>> the file_fdw problem go away for me.
>>
>> It's just a POC because there are some things that need more thought
>> than I've given them:
>>
>> 1. Is it OK to revert SIGPIPE to default processing for *all* programs
>> launched through OpenPipeStream? If not, what infrastructure do we
>> need to add to control that? In particular, is it sane to revert
>> SIGPIPE for a pipe program that we will write to not read from?
>> (I thought probably yes, because that is the normal Unix behavior,
>> but it could be debated.)
>>
>> 2. Likewise, is it always OK for ClosePipeToProgram to ignore a
>> SIGPIPE failure? (For ordinary COPY, maybe it shouldn't, since
>> we don't intend to terminate that early.)
>
> I'm not sure about that. It might in theory be telling you about some
> other pipe. If you're OK with that false positive, why not ignore all
> errors after you've read enough successful input and decided to close
> the pipe?

It's unfortunate to have that false positive, but in my opinion I think
we had better to error out if there is something wrong with the called
program, because in that case I think the data that we read from the
pipe might not be reliable. IMO I think it would be the responsibility
of the called program to handle/ignore SIGPIPE properly if necessary.

>> 3. Maybe this should be implemented at some higher level?
>
> It won't work for some programs that ignore or catch the signal, so in
> theory you might want to give users the power/responsibility to say
> "ignore errors that occur after I decide to hang up". Here are three
> different behaviours I found in popular software, showing termination
> by signal, custom error handling that we can't distinguish, and a
> bonehead strategy:

Interesting!

> $ seq 1 1000000 | head -5
> 1
> 2
> 3
> 4
> 5
> ... exit code indicates killed by signal
>
> $ python -c "for i in range(1000000): print i" | head -5
> 0
> 1
> 2
> 3
> 4
> Traceback (most recent call last):
> File "<string>", line 1, in<module>
> IOError: [Errno 32] Broken pipe
> ... exit code 1

That's sad.

> $ cat Test.java
> public class Test {
> public static void main(String[] args) {
> for (int i = 0; i< 1000000; ++i) {
> System.out.println(Integer.toString(i));
> }
> }
> }
> $ javac Test.java
> $ java Test | head -5
> 0
> 1
> 2
> 3
> 4
> ... wait a really long time with no output, exit code 0
>
> (Explanation: JVMs ignore SIGPIPE and usually convert EPIPE into an IO
> exception, except for PrintStreams like System.out which just eat data
> after an error...)

I agree that that is a bonehead strategy, but that seems not that bad to me.

>> 4. Are there any other signals we ought to be reverting to default
>> behavior before launching a COPY TO/FROM PROGRAM?
>
> On my FreeBSD system, I compared the output of procstat -i (= show
> signal disposition) for two "sleep 60" processes, one invoked from the
> shell and the other from COPY ... FROM PROGRAM. The differences were:
> PIPE, TTIN, TTOU and USR2. For the first and last of those, the
> default action would be to terminate the process, but the COPY PROGRAM
> child ignored them; for TTIN and TTOU, the default action would be to
> stop the process, but again they are ignored. Why do bgwriter.c,
> startup.c, ... set SIGTTIN and SIGTTOU back to SIG_DFL, but not
> regular backends?

So, we should revert SIGUSR2 as well to default processing?

Thanks!

Best regards,
Etsuro Fujita

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-11-07 05:01:56 BUG #15489: Segfault on DELETE
Previous Message Amit Langote 2018-11-07 01:45:57 Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-11-07 03:44:53 Re: speeding up planning with partitions
Previous Message Thomas Munro 2018-11-07 03:39:59 Re: DNS SRV support for LDAP authentication