Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Eric Cyr <eric(dot)cyr(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT
Date: 2018-11-08 01:50:35
Message-ID: CAEepm=0fBPiRkSiJ3v4ynm+aP-A-dhuHjTFBAxwo59EkE2-E5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Wed, Nov 7, 2018 at 11:03 PM Etsuro Fujita
<fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> (2018/11/07 9:22), Kyotaro HORIGUCHI wrote:
> > At Tue, 06 Nov 2018 21:07:37 +0900, Etsuro Fujita<fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote in<5BE18409(dot)2070004(at)lab(dot)ntt(dot)co(dot)jp>
> >> (2018/11/06 12:53), Kyotaro HORIGUCHI wrote:
> >>> even if it comes from another pipe, we can assume that the
> >>> SIGPIPE immediately stopped the program before returning any
> >>> garbage to us.
> >>
> >> Sorry, I don't understand this.
> >
> > Mmm. It looks confusing, sorry. In other words:
> >
> > We don't care the reason for program's ending if we received
> > enough data from the program and we actively (= before receiving
> > EOF) stop reading. On the other hand if SIGPIPE (on another pipe)
> > or something fatal happened on the program before we end reading,
> > we receive EOF just after that and we are interested in the cause
> > of the program's death.
> >
> > # Does the above make sense?
>
> Yeah, thanks for the explanation!

+1

I take back what I said earlier about false positives from other
pipes. I think it's only traditional Unix programs designed for use
in pipelines and naive programs that let SIGPIPE terminate the
process. The accepted answer here gives a good way to think about
it:

https://stackoverflow.com/questions/8369506/why-does-sigpipe-exist

A program sophisticated enough to be writing to other pipes is no
longer in that category and should be setting up signal dispositions
itself, so I agree that we should enable the default disposition and
ignore WTERMSIG(exit_code) == SIGPIPE, as proposed. That is pretty
close to the intended purpose of that signal AFAICS.

> > In the sense of "We don't care the reason", negligible reasons
> > are necessariry restricted to SIGPIPE, evan SIGSEGV could be
> > theoretically ignored safely. "theoretically" here means it is
> > another issue whether we rely on the output from a program which
> > causes SEGV (or any reason other than SIGPIPE, which we caused).
>
> For the SIGSEGV case, I think it would be better that we don't rely on
> the output data, IMO, because I think there might be a possibility that
> the program have generated that data incorrectly/unexpectedly.

+1

I don't think we should ignore termination by signals other than
SIGPIPE: that could hide serious problems from users. I want to know
if my program is crashing with SIGBUS, SIGTERM, SIGFPE etc, even if it
happens after we read enough data; there is a major problem that a
human needs to investigate!

> >>> Finally in the attached PoC, I set cstate->eof_reached on failure
> >>> of NextCopyFromRawFields then if eof_reached we don't ingore
> >>> SIGPIPE. For a program like
> >>> "puts("test1");fflush(stdout);kill(getpid(), SIGPIPE);", I had
> >>> the following behavior. (I got an error without the fflush() but
> >>> it is inevitable).
> >>>
> >>> =# select * from ft2 limit 1;
> >>> a
> >>> -------
> >>> test1
> >>>
> >>> =# select * from ft2 limit 2;
> >>> ERROR: program "/home/horiguti/work/exprog" failed
> >>> DETAIL: child process was terminated by signal 13: Broken pipe
> >>>
> >>> For the original case:
> >>>
> >>> =# select * from ft1 limit 1;
> >>> a
> >>> ------
> >>> test
> >>> =# select * from ft1 limit 2;
> >>> a
> >>> ------
> >>> test
> >>> (1 row)
> >>>
> >>>
> >>> I didn't confirmed whether it is complete.
> >>
> >> Sorry, I don't understand this fully, but the reason to add the
> >> eof_reached stuff is to avoid ignoring the SIGPIPE failure in normal
> >> COPY FROM PROGRAM cases?
> >
> > Yes, if we have received EOF, the program ended before we
> > finished reading from the pipe thus any error should be
> > reported. Elsewise we don't care at least SIGPIPE. In the
> > attached patch all kind of errors are ignored when pipe is closed
> > from reading-end on backends.
>
> OK, I understand your idea. Thanks for the patch!

+1

> > As the result it doesn't report an error for SELECT * FROM ft2
> > LIMIT 1 on "main(void){puts("test1"); return 1;}".
> >
> > =# select * from ft limit 1;
> > a
> > -------
> > test1
> > (1 row)
> >
> > limit 2 reports the error.
> >
> > =# select * from ft limit 2;
> > ERROR: program "/home/horiguti/work/exprog" failed
> > DETAIL: child process exited with exit code 1
>
> I think this would be contrary to users expectations: if the SELECT
> command works for limit 1, they would expect that the command would work
> for limit 2 as well. So, I think it would be better to error out that
> command for limit 1 as well, as-is.

I think it's correct that LIMIT 1 gives no error but LIMIT 2 gives an
error. For LIMIT 1, we got all the rows we wanted, and then we closed
the pipe. If we got a non-zero non-signal exit code, or a signal exit
code and it was SIGPIPE (not any other signal!), then we should
consider that to be expected.

I tried to think of a scenario where the already-received output is
truly invalidated by a later error that happens after we close the
pipe. It could be something involving a program that uses something
optimistic like serializable snapshot isolation that can later decide
that whatever it told you earlier is not valid after all. Suppose the
program is clever enough to expect EPIPE and not consider that to be
an error, but wants to tell us about serialization failure with a
non-zero exit code. To handle that, you'd need a way to provide an
option to file_fdw to tell it not to ignore non-zero exit codes after
close. This seems so exotic and contrived that it's not worth
bothering with for now, but could always be added.

BTW just for curiosity:

perl -e 'for (my $i=0; $i < 1000000; $i++) { print "$i\n"; }' | head -5
Exit code: terminated by SIGPIPE, like seq

ruby -e 'for i in 1..1000000 do puts i; end' | head -5
Exit code: 1, like Python

On Wed, Nov 7, 2018 at 4:44 PM Etsuro Fujita
<fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> (2018/11/06 19:50), Thomas Munro wrote:
> > On my FreeBSD system, I compared the output of procstat -i (= show
> > signal disposition) for two "sleep 60" processes, one invoked from the
> > shell and the other from COPY ... FROM PROGRAM. The differences were:
> > PIPE, TTIN, TTOU and USR2. For the first and last of those, the
> > default action would be to terminate the process, but the COPY PROGRAM
> > child ignored them; for TTIN and TTOU, the default action would be to
> > stop the process, but again they are ignored. Why do bgwriter.c,
> > startup.c, ... set SIGTTIN and SIGTTOU back to SIG_DFL, but not
> > regular backends?
>
> So, we should revert SIGUSR2 as well to default processing?

I don't think it matters in practice, but it might be nice to restore
that just for consistency. I'm not sure what to think about the TTIN,
TTOU stuff; I don't understand job control well right now but I don't
think it really applies to programs run by a PostgreSQL backend, so if
we restore those it'd probably again be only for consistency. Then
again, there may be a reason someone decided to ignore those in the
postmaster + regular backends but not the various auxiliary processes.
Anyone?

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-11-08 04:30:05 BUG #15492: pg_cancel_backend(pg_backend_pid()) returns true sporadically
Previous Message Amit Langote 2018-11-08 01:33:39 Re: BUG #15489: Segfault on DELETE

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-11-08 01:52:09 Re: jsonpath
Previous Message Peter Geoghegan 2018-11-08 01:49:31 Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock