Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

From: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Eric Cyr <eric(dot)cyr(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT
Date: 2018-11-08 12:52:31
Message-ID: 5BE4318F.4040002@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

(2018/11/08 10:50), Thomas Munro wrote:
> On Wed, Nov 7, 2018 at 11:03 PM Etsuro Fujita
> <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> (2018/11/07 9:22), Kyotaro HORIGUCHI wrote:
>>> At Tue, 06 Nov 2018 21:07:37 +0900, Etsuro Fujita<fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote in<5BE18409(dot)2070004(at)lab(dot)ntt(dot)co(dot)jp>
>>>> (2018/11/06 12:53), Kyotaro HORIGUCHI wrote:
>>>>> even if it comes from another pipe, we can assume that the
>>>>> SIGPIPE immediately stopped the program before returning any
>>>>> garbage to us.
>>>>
>>>> Sorry, I don't understand this.
>>>
>>> Mmm. It looks confusing, sorry. In other words:
>>>
>>> We don't care the reason for program's ending if we received
>>> enough data from the program and we actively (= before receiving
>>> EOF) stop reading. On the other hand if SIGPIPE (on another pipe)
>>> or something fatal happened on the program before we end reading,
>>> we receive EOF just after that and we are interested in the cause
>>> of the program's death.
>>>
>>> # Does the above make sense?
>>
>> Yeah, thanks for the explanation!
>
> +1
>
> I take back what I said earlier about false positives from other
> pipes. I think it's only traditional Unix programs designed for use
> in pipelines and naive programs that let SIGPIPE terminate the
> process. The accepted answer here gives a good way to think about
> it:
>
> https://stackoverflow.com/questions/8369506/why-does-sigpipe-exist

Thanks for the information!

> A program sophisticated enough to be writing to other pipes is no
> longer in that category and should be setting up signal dispositions
> itself, so I agree that we should enable the default disposition and
> ignore WTERMSIG(exit_code) == SIGPIPE, as proposed. That is pretty
> close to the intended purpose of that signal AFAICS.

Great!

>>> In the sense of "We don't care the reason", negligible reasons
>>> are necessariry restricted to SIGPIPE, evan SIGSEGV could be
>>> theoretically ignored safely. "theoretically" here means it is
>>> another issue whether we rely on the output from a program which
>>> causes SEGV (or any reason other than SIGPIPE, which we caused).
>>
>> For the SIGSEGV case, I think it would be better that we don't rely on
>> the output data, IMO, because I think there might be a possibility that
>> the program have generated that data incorrectly/unexpectedly.
>
> +1
>
> I don't think we should ignore termination by signals other than
> SIGPIPE: that could hide serious problems from users. I want to know
> if my program is crashing with SIGBUS, SIGTERM, SIGFPE etc, even if it
> happens after we read enough data; there is a major problem that a
> human needs to investigate!

I think so too.

>>>>> Finally in the attached PoC, I set cstate->eof_reached on failure
>>>>> of NextCopyFromRawFields then if eof_reached we don't ingore
>>>>> SIGPIPE. For a program like
>>>>> "puts("test1");fflush(stdout);kill(getpid(), SIGPIPE);", I had
>>>>> the following behavior. (I got an error without the fflush() but
>>>>> it is inevitable).
>>>>>
>>>>> =# select * from ft2 limit 1;
>>>>> a
>>>>> -------
>>>>> test1
>>>>>
>>>>> =# select * from ft2 limit 2;
>>>>> ERROR: program "/home/horiguti/work/exprog" failed
>>>>> DETAIL: child process was terminated by signal 13: Broken pipe
>>>>>
>>>>> For the original case:
>>>>>
>>>>> =# select * from ft1 limit 1;
>>>>> a
>>>>> ------
>>>>> test
>>>>> =# select * from ft1 limit 2;
>>>>> a
>>>>> ------
>>>>> test
>>>>> (1 row)
>>>>>
>>>>>
>>>>> I didn't confirmed whether it is complete.
>>>>
>>>> Sorry, I don't understand this fully, but the reason to add the
>>>> eof_reached stuff is to avoid ignoring the SIGPIPE failure in normal
>>>> COPY FROM PROGRAM cases?
>>>
>>> Yes, if we have received EOF, the program ended before we
>>> finished reading from the pipe thus any error should be
>>> reported. Elsewise we don't care at least SIGPIPE. In the
>>> attached patch all kind of errors are ignored when pipe is closed
>>> from reading-end on backends.
>>
>> OK, I understand your idea. Thanks for the patch!
>
> +1
>
>>> As the result it doesn't report an error for SELECT * FROM ft2
>>> LIMIT 1 on "main(void){puts("test1"); return 1;}".
>>>
>>> =# select * from ft limit 1;
>>> a
>>> -------
>>> test1
>>> (1 row)
>>>
>>> limit 2 reports the error.
>>>
>>> =# select * from ft limit 2;
>>> ERROR: program "/home/horiguti/work/exprog" failed
>>> DETAIL: child process exited with exit code 1
>>
>> I think this would be contrary to users expectations: if the SELECT
>> command works for limit 1, they would expect that the command would work
>> for limit 2 as well. So, I think it would be better to error out that
>> command for limit 1 as well, as-is.
>
> I think it's correct that LIMIT 1 gives no error but LIMIT 2 gives an
> error. For LIMIT 1, we got all the rows we wanted, and then we closed
> the pipe. If we got a non-zero non-signal exit code, or a signal exit
> code and it was SIGPIPE (not any other signal!), then we should
> consider that to be expected.

Maybe I'm missing something, but the non-zero non-signal exit code means
that there was something wrong with the called program, so I think a
human had better investigate that as well IMO, which would probably be a
minor problem, though. Too restrictive?

> I tried to think of a scenario where the already-received output is
> truly invalidated by a later error that happens after we close the
> pipe. It could be something involving a program that uses something
> optimistic like serializable snapshot isolation that can later decide
> that whatever it told you earlier is not valid after all. Suppose the
> program is clever enough to expect EPIPE and not consider that to be
> an error, but wants to tell us about serialization failure with a
> non-zero exit code. To handle that, you'd need a way to provide an
> option to file_fdw to tell it not to ignore non-zero exit codes after
> close. This seems so exotic and contrived that it's not worth
> bothering with for now, but could always be added.

Interesting! I agree that such an option could add more flexibility in
handling the non-zero-exit-code case.

> BTW just for curiosity:
>
> perl -e 'for (my $i=0; $i< 1000000; $i++) { print "$i\n"; }' | head -5
> Exit code: terminated by SIGPIPE, like seq

Good to know!

> ruby -e 'for i in 1..1000000 do puts i; end' | head -5
> Exit code: 1, like Python

Sad. Anyway, thanks a lot for these experiments in addition to the
previous ones!

> On Wed, Nov 7, 2018 at 4:44 PM Etsuro Fujita
> <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> (2018/11/06 19:50), Thomas Munro wrote:
>>> On my FreeBSD system, I compared the output of procstat -i (= show
>>> signal disposition) for two "sleep 60" processes, one invoked from the
>>> shell and the other from COPY ... FROM PROGRAM. The differences were:
>>> PIPE, TTIN, TTOU and USR2. For the first and last of those, the
>>> default action would be to terminate the process, but the COPY PROGRAM
>>> child ignored them; for TTIN and TTOU, the default action would be to
>>> stop the process, but again they are ignored. Why do bgwriter.c,
>>> startup.c, ... set SIGTTIN and SIGTTOU back to SIG_DFL, but not
>>> regular backends?
>>
>> So, we should revert SIGUSR2 as well to default processing?
>
> I don't think it matters in practice, but it might be nice to restore
> that just for consistency.

Agreed.

> I'm not sure what to think about the TTIN,
> TTOU stuff; I don't understand job control well right now but I don't
> think it really applies to programs run by a PostgreSQL backend, so if
> we restore those it'd probably again be only for consistency. Then
> again, there may be a reason someone decided to ignore those in the
> postmaster + regular backends but not the various auxiliary processes.
> Anyone?

I don't have any idea about that.

Best regards,
Etsuro Fujita

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-11-08 14:11:14 BUG #15493: Wrong name of fields/missing fields for the internal statistic
Previous Message Amit Langote 2018-11-08 12:37:56 Re: BUG #15448: server process (PID 22656) was terminated by exception 0xC0000005

Browse pgsql-hackers by date

  From Date Subject
Next Message Surafel Temesgen 2018-11-08 13:02:24 Re: pg_dump multi VALUES INSERT
Previous Message Alvaro Herrera 2018-11-08 12:40:15 Re: fix psql \conninfo & \connect when using hostaddr