Re: logical replication empty transactions

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Euler Taveira <euler(at)timbira(dot)com(dot)br>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication empty transactions
Date: 2020-03-13 06:39:43
Message-ID: CAMsr+YE3o8Dt890Q8wTooY2MpN0JvdHqUAHYL-LNhBryXOPaKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 10 Mar 2020 at 02:30, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2020-03-06 13:53:02 +0800, Craig Ringer wrote:
> > On Mon, 2 Mar 2020 at 19:26, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > > One thing that is not clear to me is how will we advance restart_lsn
> > > if we don't send any empty xact in a system where there are many such
> > > xacts?
> >
> > Same way we already do it for writes that are not replicated over
> > logical replication, like vacuum work etc. The upstream sends feedback
> > with reply-requested. The downstream replies. The upstream advances
> > confirmed_flush_lsn, and that lazily updates restart_lsn.
>
> It'll still delay it a bit.
>

Right, but we don't generally care because there's no sync rep txn waiting
for confirmation. If we lose progress due to a crash it doesn't matter. It
does delay removal of old WAL a little, but it hardly matters.

> Somewhat independent from the issue at hand: It'd be really good if we
> could evolve the syncrep framework to support per-database waiting... It
> shouldn't be that hard, and the current situation sucks quite a bit (and
> yes, I'm to blame).
>

Hardly, you just didn't get the chance to fix that on top of the umpteen
other things you had to change to make all the logical stuff work. You
didn't break it, just didn't implement every single possible enhancement
all at once. Shocking, I tell you.

I'm not quite sure what you mean by "poke the walsender"? Kinda sounds
> like sending a signal, but decoding happens inside after the walsender,
> so there's no need for that. Do you just mean somehow requesting that
> walsender sends a feedback message?
>

Right. I had in mind something like sending a ProcSignal via our funky
multiplexed signal mechanism to ask the walsender to immediately generate a
keepalive message with a reply-requested flag, then set the walsender's
latch so we wake it promptly.

> To address the volume we could:
>
> 1a) Introduce a pgoutput message type to indicate that the LSN has
> advanced, without needing separate BEGIN/COMMIT. Right now BEGIN is
> 21 bytes, COMMIT is 26. But we really don't need that much here. A
> single message should do the trick.
>

It would. Is it worth caring though? Especially since it seems rather
unlikely that the actual network data volume of begin/commit msgs will be
much of a concern. It's not like we're PITRing logical streams, and if we
did, we could just filter out empty commits on the receiver side.

That message pretty much already exists in the form of a walsender
keepalive anyway so we might as well re-use that and not upset the protocol.

> 1b) Add a LogicalOutputPluginWriterUpdateProgress parameter (and
> possibly rename) that indicates that we are intentionally "ignoring"
> WAL. For walsender that callback then could check if it could just
> forward the position of the client (if it was entirely caught up
> before), or if it should send a feedback request (if syncrep is
> enabled, or distance is big).
>

I can see something like that being very useful, because at present only
the output plugin knows if a txn is "empty" as far as that particular slot
and output plugin is concerned. The reorder buffering mechanism cannot do
relation-level filtering before it sends the changes to the output plugin
during ReorderBufferCommit, since it only knows about relfilenodes not
relation oids. And the output plugin might be doing finer grained filtering
using row-filter expressions or who knows what else.

But as described above that will only help for txns done in DBs other than
the one the logical slot is for or txns known to have an empty
ReorderBuffer when the commit is seen.

If there's a txn in the slot's db with a non-empty reorderbuffer, the
output plugin won't know if the txn is empty or not until it finishes
processing all callbacks and sees the commit for the txn. So it will
generally have emitted the Begin message on the wire by the time it knows
it has nothing useful to say. And Pg won't know that this txn is empty as
far as this output plugin with this particular slot, set of output plugin
params, and current user-catalog state is concerned, so it won't have any
way to call the output plugin's "update progress" callback instead of the
usual begin/change/commit callbacks.

But I think we can already skip empty txns unless sync-rep is enabled with
no core changes, and send empty txns as walsender keepalives instead, by
altering only output plugins, like this:

* Stash BEGIN data in plugin's LogicalDecodingContext.output_plugin_private
when plugin's begin callback called, don't write anything to the outstream
* Write out BEGIN message lazily when any other callback generates a
message that does need to be written out
* If no BEGIN written by the time COMMIT callback called, discard the
COMMIT too. Check if sync rep enabled. if it is,
call LogicalDecodingContext.update_progress from within the output plugin
commit handler, otherwise just ignore the commit totally. Probably by
calling OutputPluginUpdateProgress().

We could e.g. have a new LogicalDecodingContext callback that is
> called whenever WalSndWaitForWal() would wait. That'd check if there's
> a pending "need" to send out a 'empty transaction'/feedback request
> message. The "need" flag would get cleared whenever we send out data
> bearing an LSN for other reasons.
>

I can see that being handy, yes. But it won't necessarily help with the
sync rep issue, since other sync rep txns may continue to generate WAL
while others wait for commit-confirmations that won't come from the logical
replica.

While we're speaking of adding output plugin hooks, I keep on trying to
think of a sensible way to do a plugin-defined reply handler, so the
downstream end can send COPY BOTH messages of some new msgkind back to the
walsender, which will pass them to the output plugin if it implements the
appropriate handle_reply_message (or whatever) callback. That much is
trivial to implement, where I keep getting a bit stuck is with whether
there's a sensible snapshot that can be set to call the output plugin reply
handler with. We wouldn't want to switch to a current non-historic snapshot
because of all the cache flushes that'd cause, but there isn't necessarily
a valid and safe historic snapshot to set when we're not within
ReorderBufferCommit is there?

I'd love to get rid of the need to "connect back" to a provider over plain
libpq connections to communicate with it. The ability to run SQL on the
walsender conn helps. But really, so much more would be possible if we
could just have the downstream end *reply* on the same connection using
COPY BOTH, much like it sends replay progress updates right now. It'd let
us manage relation/attribute/type metadata caches better for example.

Thoughts?

--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message imai.yoshikazu@fujitsu.com 2020-03-13 06:54:28 RE: Planning counters in pg_stat_statements (using pgss_store)
Previous Message imai.yoshikazu@fujitsu.com 2020-03-13 06:35:48 RE: Planning counters in pg_stat_statements (using pgss_store)