Re: Refactoring backend fork+exec code

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>, Tristan Partin <tristan(at)neon(dot)tech>, Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Refactoring backend fork+exec code
Date: 2023-12-03 14:41:47
Message-ID: 70e5dada-3cd4-4d7e-80eb-e841a6f35df8@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/12/2023 05:00, Alexander Lakhin wrote:
> What bothered me additionally, is an error detected after server start. I
> couldn't see it without the patches applied. I mean, on HEAD I now see
> `make check` passing, but with the patches it fails:
> ...
> # parallel group (20 tests):  interval date numerology polygon box macaddr8 macaddr multirangetypes line timestamp
> timetz timestamptz time circle strings lseg inet md5 path point
> not ok 22    + strings                                  1048 ms
> # (test process exited with exit code 2)
> not ok 23    + md5                                      1052 ms
> # (test process exited with exit code 2)
> ...
> src/test/regress/log/postmaster.log contains:
> ==00:00:00:30.730 1713480== Syscall param write(buf) points to uninitialised byte(s)
> ==00:00:00:30.730 1713480==    at 0x5245A37: write (write.c:26)
> ==00:00:00:30.730 1713480==    by 0x51BBF6C: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1180)
> ==00:00:00:30.730 1713480==    by 0x51BC84F: new_do_write (fileops.c:448)
> ==00:00:00:30.730 1713480==    by 0x51BC84F: _IO_new_file_xsputn (fileops.c:1254)
> ==00:00:00:30.730 1713480==    by 0x51BC84F: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1196)
> ==00:00:00:30.730 1713480==    by 0x51B1056: fwrite (iofwrite.c:39)
> ==00:00:00:30.730 1713480==    by 0x5540CF: internal_forkexec (postmaster.c:4526)
> ==00:00:00:30.730 1713480==    by 0x5543C0: bgworker_forkexec (postmaster.c:5624)
> ==00:00:00:30.730 1713480==    by 0x555477: do_start_bgworker (postmaster.c:5665)
> ==00:00:00:30.730 1713480==    by 0x555738: maybe_start_bgworkers (postmaster.c:5928)
> ==00:00:00:30.730 1713480==    by 0x556072: process_pm_pmsignal (postmaster.c:5080)
> ==00:00:00:30.730 1713480==    by 0x556610: ServerLoop (postmaster.c:1761)
> ==00:00:00:30.730 1713480==    by 0x557BE2: PostmasterMain (postmaster.c:1469)
> ==00:00:00:30.730 1713480==    by 0x47216B: main (main.c:198)
> ==00:00:00:30.730 1713480==  Address 0x1ffeffd8c0 is on thread 1's stack
> ==00:00:00:30.730 1713480==  in frame #4, created by internal_forkexec (postmaster.c:4482)
> ==00:00:00:30.730 1713480==
> ...

Ack, I see this too. I fixed it by adding MCXT_ALLOC_ZERO to the
allocation of the BackendWorker struct. That's a little heavy-handed,
like with the previous failures the uninitialized padding bytes are
written to the file and read back, but not accessed after that. But it
seems like the simplest fix. This isn't performance critical after all.

I also renamed the 'has_worker' field to 'has_bgworker', per Tristan's
suggestion. Pushed with those changes.

Thanks for the reviews!

> 2023-12-02 05:14:30.751 MSK client backend[1713740] pg_regress/rangetypes FATAL:  terminating connection due to
> unexpected postmaster exit
> 2023-12-02 05:14:31.033 MSK client backend[1713734] pg_regress/numeric FATAL:  postmaster exited during a parallel
> transaction
> TRAP: failed Assert("!IsTransactionOrTransactionBlock()"), File: "pgstat.c", Line: 591, PID: 1713734
>
> I haven't looked deeper yet, but it seems that we see two issues here (and
> Assert is not directly caused by the patches set.)

I have not been able to reproduce this one.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2023-12-03 14:53:49 Re: Emitting JSON to file using COPY TO
Previous Message Dmitry Dolgov 2023-12-03 14:27:47 Re: [RFC] Clang plugin for catching suspicious typedef casting