Proposals for making it easier to write correct bgworkers

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Proposals for making it easier to write correct bgworkers
Date: 2020-09-10 03:02:07
Message-ID: CAMsr+YFYyhtxLJCg05R4EJk3JEpw77yT9E_L9awi2p4c7JD+eQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all

As I've gained experience working on background workers, it's become
increasingly clear that they're a bit too different to normal backends for
many nontrivial uses.

I thought I'd take a moment to note some of it here, along with some
proposals for things we could potentially do to make it much easier to use
bgworkers correctly especially when using them to run queries.

This is NOT A PATCH SET. It's a set of discussion proposals and it's also
intended as a bit of a helper for people just getting started on bgworkers.
There are a lot of subtle differences in the runtime environment a basic
bgworker provides vs the runtime environment extension authors will be used
to when writing fmgr-callable C functions.

(It looks like pg12 and pg13 have some improvements, so some of the issues
I was going to mention with error cleanup paths and locking aren't relevant
anymore.)

DIFFERENCES WHEN CODING FOR BGWORKERS
----

Some of the most important differences vs normal backends that I've noticed
are:

* There's no supported way to recover from an error and continue execution.
Each bgworker has to roll its own logic based on PostgresMain()'s
sigsetjmp() handler and maintain it.
* The bgworker infrastructure doesn't set up the the application_name in
PGPROC
* Example bgworkers roll their own signal handlers rather than using die,
procsignal_sigusr1_handler, and PostgresSigHupHandler. This makes them seem
to work OK, but fail to respect CHECK_FOR_INTERRUPTS() and other expected
behaviour when they call into the executor etc.
* Each bgworker has to roll its own mainloop with ConfigReloadPending /
ProcessConfigFile() etc, and this isn't illustrated at all in the examples
* Example workers don't interact with the pgstat subsystem

Managing bgworker lifecycle is also quite difficult. The postmaster doesn't
offer any help and we don't expose much help from core. Any extension that
spawns workers needs to do its own handling of worker startup registration
in shmem, worker exit reaping, handling workers that exit before
registering, etc. This can be seen in the logical replication code, which
rolls its own management per src/backend/replication/logical/launcher.c .

PROPOSED CHANGES IN EXAMPLE BGWORKERS/CONTRIBS
-----

So I'd like to make these changes to example bgworkers and to contrib
extensions:

* Change example bgworkers in contrib to set up normal default signal
handlers, not define their own
* Add ConfigReloadPending and ProcessConfigFile() to all example bgworkers
* Show pgstat use in example bgworkers

PROPOSED BGW SETUP AND MAINLOOP HELPERS
----

and I'm wondering if anyone thinks it's a good idea to add some bgworker
mainloop helper API, where one call is made at the start of the bgw, before
BackgroundWorkerInitializeConnection(), to:

* set application_name
* set process title
* set up default signal handlers and unblock signals
* creates and assigns a BGWMemoryContext named after the bgw
* set state to idle in pgstat
* sets up MyProcPort (or teach InitPostgres to do so for a bgw)
* takes a function pointer for a before_shmem_exit cleanup handler
to encourage ext authors to use this approach

then a mainloop helper that:

* Checks that CurrentMemoryContext == BGWMemoryContext
* Runs CHECK_FOR_INTERRUPTS()
* Checks for postmaster death and exits
* Checks for and performs config file reload

PROPOSED ERROR HANDLING HELPER
----

It'd also be beneficial to have a helper for bgw mainloops with error
recovery, generalizing and extracting the pattern I see in all these
routines:

src/backend/postmaster/autovacuum.c=AutoVacLauncherMain
src/backend/postmaster/autovacuum.c=AutoVacWorkerMain
src/backend/postmaster/bgworker.c=StartBackgroundWorker
src/backend/postmaster/bgwriter.c=BackgroundWriterMain
src/backend/postmaster/checkpointer.c=CheckpointerMain
src/backend/postmaster/walwriter.c=WalWriterMain
src/backend/tcop/postgres.c=PostgresMain

which all do their own error reset loops. There's way too much copy/paste
going on there IMO, even before we consider bgworkers, and it's nearly
impossible for anyone who isn't quite an experienced Pg developer to write
a bgw that can do anything with errors except promote ERROR to FATAL and
die.

This would be modeled on PosgresMain(). The worker would only need to add
its own logic to the sigsetjmp() error jump path, not duplicate all the
core cleanup work.

PROPOSED GENERALISED WORKER MANAGEMENT
----

Finally I'm wondering if there's any interest in generalizing the logical
rep worker management for other bgworkers. I've done a ton of work with
worker management and it's something I'm sure I could take on but I don't
want to write it without knowing there's some chance of acceptance.

The general idea is to provide a way for bgworkers to start up managers for
pools / sets of workers. They launch them and have a function they can call
in their mainloop that watches their child worker states, invoking
callbacks when they fail to launch, launch successfully, exit cleanly after
finishing their work, or die with an error. Workers are tracked in a shmem
seg where the start of the seg must be a key struct (akin to how the hash
API works). We would provide calls to look up a worker shmem struct by key,
signal a worker by key, wait for a worker to exit (up to timeout), etc.
Like in the logical rep code, access to the worker registration shmem would
be controlled by LWLock. The extension code can put whatever it wants in
the worker shmem entries after the key, including various unions or
whatever - the worker management API won't care.

This abstracts all the low level mess away from bgworker implementations
and lets them focus on writing the code they want to run.

I'd probably suggest doing so by extracting the logical rep worker
management, and making the logical rep code use the generalized worker
management. So it'd be proven, and have in core users.

PROPOSED XACT API WRAPPER
----

Additionally there are some more general postgres API patterns that the
team working on pglogical and BDR has landed up writing wrappers for or
compensating for. One of the most important is transaction management. In
many places in pglogical where we start lightweight transactions in
pglogical we use a wrapper that starts the txn if there isn't already one
open and remembers whether one was started. Then the end helper closes it
only if the start helper opened it. Finally, the memory context is always
reset to the memory context before txn start once the txn ends, instead of
being left at TopMemoryContext, since we've learned that that's an
extremely common source of bugs.

I propose a wrapper for lightweight txn management, like the one we use in
many places in pglogical. This is not a copy of that code, it's just a
scratched out example of roughly how we do it:

LocalTransactionState txn;
StartTransactionIfNotStarted(&txn, LXACT_WANT_SNAPSHOT);
...
CommitTransactionCommandIfStarted(&txn);

where

static inline void
StartTransactionIfNotStarted(LocalTransactionState *txn, int flags)
{
txn->started = !TransactionInProgress();
txn->save_mctx = CurrentMemoryContext;
if (txn->started)
{
txn->pushed_snapshot = false;
StartTransactionCommand();
if (flags & LXACT_WANT_SNAPSHOT)
{
txn->pushed_snapshot = true;
PushActiveSnapshot(GetTransactionSnapshot());
}
}
}

static inline void
CommitTransactionIfNotStarted(LocalTransactionState *txn)
{
if (txn->started)
{
Assert(TransactionInProgress());
Assert(CurrentMemoryContext == TopTransactionContext);
if (txn->pushed_snapshot)
PopActiveSnapshot();
CommitTransactionCommand();
}
else
{
/* Caller didn't change context between start and end */
Assert(CurrentMemoryContext == txn->save_mctx);
}
(void) MemoryContextSwitchTo(txn->save_mctx);
}

--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-09-10 04:00:01 Re: Resetting spilled txn statistics in pg_stat_replication
Previous Message tsunakawa.takay@fujitsu.com 2020-09-10 02:56:37 RE: SIGQUIT handling, redux