Quick Links

Re: src/test/subscription/t/002_types.pl hanging on particular environment

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: src/test/subscription/t/002_types.pl hanging on particular environment
Date:	2017-09-20 04:52:17
Message-ID:	CAMsr+YF2TkbifMwxCE4DxMtTN=B5M_uHBLehBW6C4o_xu9yWKQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 20 September 2017 at 12:06, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Wed, Sep 20, 2017 at 9:23 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Craig Ringer <craig(at)2ndquadrant(dot)com> writes:
> >> On 19 September 2017 at 18:04, Petr Jelinek <
> petr(dot)jelinek(at)2ndquadrant(dot)com>
> >> wrote:
> >>> If you are asking why they are not identified by the
> >>> BackgroundWorkerHandle, then it's because it's private struct and can't
> >>> be shared with other processes so there is no way to link the logical
> >>> worker info with bgworker directly.
> >
> >> I really want BackgroundWorkerHandle to be public, strong +1 from me.
> >
> > I'm confused about what you think that would accomplish. AFAICS, the
> > point of BackgroundWorkerHandle is to allow the creator/requestor of
> > a bgworker to verify whether or not the slot in question is still
> > "owned" by its request.
> >
>
> Right, but can we avoid maintaining additional information (in_use,
> generation,..) in LogicalRepWorker which is similar to bgworker worker
> machinery (which in turn can also avoid all the housekeeping for those
> variables) if we have access to BackgroundWorkerHandle?
> <http://www.enterprisedb.com>
>

As far as I can see, probably not.

Because struct BackgroundWorker only has a single by-value Datum argument,
you can't pass much more information to a bgworker than "here's a slot
number to look up what you need to know to configure yourself". That can be
a ToC position for a shm_mq, or some kind of shmem array, but you need
something. You can't pass a pointer because of ASLR and EXEC_BACKEND. If
the launcher lives significantly longer than its workers, where a worker
dying isn't fatal to the launcher, it needs to be able to re-use slots in
the fixed-size shmem array workers use to self-configure. Which means some
kind of generation counter, like we have in BackgroundWorkerHandle, and an
in_use flag.

Knowing a logical worker's bgworker has started isn't
enough. WaitForReplicationWorkerAttach needs to know it has come up
successfully as a logical replication worker. To do that, it needs to know
that the entry it's looking at in LogicalRepWorker isn't for some prior
logical replication worker that previously had the same LogicalRepWorker
slot, though not necessarily the same BackgroundWorker slot. There's no 1:1
mapping of LogicalRepWorker slot to BackgroundWorker slot to rely on.

So unless we're willing to extend struct BackgroundWorker with some kind of
union that can be used to store extra info for logical rep workers and
whatever else we might later want, I don't see LogicalRepWorker going away.
If we do that, it'll make extending the shmem for logical replication
workers harder since it'll grow the entry for every background worker, not
just logical replication workers.

Parallel query gets away without most of this because as far as I can see
parallel workers don't need to enumerate each other or look up each others'
state, which logical rep can need to do for things like initial table sync.
It doesn't appear to re-launch workers unless it has a clean slate and can
clobber the entire parallel DSM context, either. So it doesn't need to
worry about whether the worker it's talking to is the one it just launched,
or some old worker that hasn't gone away yet. The DSM segment acts as a
generation counter of sorts, with all workers in the same generation.

If we wanted to do away with LogicalRepWorker shmem slots, I suspect we'd
have to broker all inter-worker communications via shm_mq pairs, where
worker A asks the launcher for the status of worker B, or to block until
worker B does X, or whatever. Do everything over shm_mq messaging not shmem
variables. That's a pretty major change, and makes the launcher responsible
for all synchronisation and IPC. It also wouldn't work properly unless we
nuked the DSM segment containing all the queues whenever any worker died
and restarted the whole lot, which would be terribly costly in terms of
restarting transaction replays etc. Or we'd have to keep some kind of
per-shm_mq-pair "in use" flag so we knew when to nuke and reset the queue
pair for a new worker to connect, at which point we're halfway back to
where we started...

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: src/test/subscription/t/002_types.pl hanging on particular environment at 2017-09-20 04:06:50 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Craig Ringer	2017-09-20 05:01:08	Re: src/test/subscription/t/002_types.pl hanging on particular environment
Previous Message	Michael Paquier	2017-09-20 04:52:15	Re: "inconsistent page found" with checksum and wal_consistency_checking enabled