Re: [HACKERS] logical decoding of two-phase transactions

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] logical decoding of two-phase transactions
Date: 2018-07-17 18:10:32
Message-ID: CA+Tgmoa8yDZH0i_58=o6WBESbyqij8m5hKYycV2sWzo2rX9fGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Oh, right, I forgot the patch also adds the leader into the group, for
> some reason (I agree it's unclear why that would be necessary, as you
> pointed out later).
>
> But all this is happening while holding the partition lock (in exclusive
> mode). And the decoding backends do synchronize on the lock correctly
> (although, man, the rechecks are confusing ...).
>
> But now I see ProcKill accesses decodeGroupLeader in multiple places,
> and only the first one is protected by the lock, for some reason
> (interestingly enough the one in lockGroupLeader block). Is that what
> you mean?

I haven't traced out the control flow completely, but it sure looks to
me like there are places where decodeGroupLeader is checked without
holding any LWLock at all. Also, it looks to me like some places
(like where we're trying to find a PGPROC by XID) we use ProcArrayLock
and in others -- I guess where we're checking the decodeGroupBlah
stuff -- we are using the lock manager locks. I don't know how safe
that is, and there are not a lot of comments justifying it. I also
wonder why we're using the lock manager locks to protect the
decodeGroup stuff rather than backendLock.

> FWIW I suspect the ProcKill part is borked due to incorrectly resolved
> merge conflict or something, per my initial response from today.

Yeah I wasn't seeing the code the way I thought you were describing it
in that response, but I'm dumb this week so maybe I just
misunderstood.

>> I think that's probably not going to work out, but of course it's up
>> to you how you want to spend your time!
>
> Well, yeah. I'm sure I could think of more fun things to do, but OTOH I
> also have patches that require the capability to decode in-progress
> transactions.

It's not a matter of fun; it's a matter of whether it can be made to
work. Don't get me wrong -- I want the ability to decode in-progress
transactions. I complained about that aspect of the design to Andres
when I was reviewing and committing logical slots & logical decoding,
and I complained about it probably more than I complained about any
other aspect of it, largely because it instantaneously generates a
large lag when a bulk load commits. But not liking something about
the way things are is not the same as knowing how to make them better.
I believe there is a way to make it work because I believe there's a
way to make anything work. But I suspect that it's at least one order
of magnitude more complex than this patch currently is, and likely an
altogether different design.

> But the way I understand it, it pretty much *is* a list of waiters,
> along with a couple of flags to allow the processes to notify the other
> side about lock/unlock/abort. It does resemble the lock groups, but I
> don't think it has the same goals.

So the parts that aren't relevant shouldn't be copied over.

>> That having been said, I still don't see how that's really going to
>> work. Just to take one example, suppose that the leader is trying to
>> ERROR out, and the decoding workers are blocked waiting for a lock
>> held by the leader. The system has no way of detecting this deadlock
>> and resolving it automatically, which certainly seems unacceptable.
>> The only way that's going to work is if the leader waits for the
>> worker by trying to acquire a lock held by the worker. Then the
>> deadlock detector would know to abort some transaction. But that
>> doesn't really work either - the deadlock was created by the
>> foreground process trying to abort, and if the deadlock detector
>> chooses that process as its victim, what then? We're already trying
>> to abort, and the abort code isn't supposed to throw further errors,
>> or fail in any way, lest we break all kinds of other things. Not to
>> mention the fact that running the deadlock detector in the abort path
>> isn't really safe to begin with, again because we can't throw errors
>> when we're already in an abort path.
>
> Fair point, not sure. I'll leave this up to Nikhil.

That's fine, but please understand that I think there's a basic design
flaw here that just can't be overcome with any amount of hacking on
the details here. I think we need a much higher-level consideration
of the problem here and probably a lot of new infrastructure to
support it. One idea might be to initially support decoding of
in-progress transactions only if they don't modify any catalog state.
That would leave out a bunch of cases we'd probably like to support,
such as CREATE TABLE + COPY in the same transaction, but it would
likely dodge a lot of really hard problems, too, and we could improve
things later. One approach to the problem of catalog changes would be
to prevent catalog tuples from being removed even after transaction
abort until such time as there's no decoding in progress that might
care about them. That is not by itself sufficient because a
transaction can abort after inserting a heap tuple but before
inserting an index tuple and we can't look at the catalog when it's an
inconsistent state like that and expect reasonable results. But it
helps: for example, if you are decoding a transaction that has
inserted a WAL record with a cmin or cmax value of 4, and you know
that none of the catalog records created by that transaction can have
been pruned, then it should be safe to use a snapshot with CID 3 or
smaller to decode the catalogs. So consider a case like:

BEGIN;
CREATE TABLE blah ... -- command ID 0
COPY blah FROM '/tmp/blah' ... -- command ID 1

Once we see the COPY show up in the WAL, it should be safe to decode
the CREATE TABLE command and figure out what a snapshot with command
ID 0 can see (again, assuming we've suppressed pruning in the catalogs
in a sufficiently-well-considered way). Then, as long as the COPY
command doesn't do any DML via a trigger or a datatype input function
(!) or whatever, we should be able to use that snapshot to decode the
data inserted by COPY. I'm not quite sure what happens if the COPY
does do some DML or something like that -- we might have to stop
decoding until the following command begins in the live transaction,
or something like that. Or maybe we don't have to do that. I'm not
totally sure how the command counter is managed for catalog snapshots.
However it works in detail, we will get into trouble if we ever use a
catalog snapshot that can see a change that the live transaction is
still in the midst of making. Even with pruning prevented, we can
only count on the catalogs to be in a consistent state once the live
transaction has finished the command -- otherwise, for example, it
might have increased pg_class.relnatts but not yet added the
pg_attribute entry at the time it aborts, or something like that. I'm
blathering a little bit but hopefully you get the point: I think the
way forward is for somebody to think carefully through how and under
what circumstances using a catalog snapshot can be made safe even if
an abort has occurred afterwards -- not trying to postpone the abort,
which I think is never going to be right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2018-07-17 18:41:03 Re: GiST VACUUM
Previous Message Daniel Gustafsson 2018-07-17 18:03:40 Re: Allow auto_explain to log to NOTICE