From: | Simon Riggs <simon(at)2ndQuadrant(dot)com> |
---|---|
To: | Craig Ringer <craig(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Logical decoding slots can go backwards when used from SQL, docs are wrong |
Date: | 2016-03-11 09:00:57 |
Message-ID: | CANP8+jL4SVA7OVKL-FkH6T18O+5eAtY1MoAuLjZnj8OAXJp6XQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 11 March 2016 at 08:19, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> Hi all
>
> I think I found a couple of logical decoding issues while writing tests
> for failover slots.
>
> Despite the docs' claim that a logical slot will replay data "exactly
> once", a slot's confirmed_lsn can go backwards and the SQL functions can
> replay the same data more than once.We don't mark a slot as dirty if only
> its confirmed_lsn is advanced, so it isn't flushed to disk. For failover
> slots this means it also doesn't get replicated via WAL. After a master
> crash, or for failover slots after a promote event, the confirmed_lsn will
> go backwards. Users of the SQL interface must keep track of the safely
> locally flushed slot position themselves and throw the repeated data away.
> Unlike with the walsender protocol it has no way to ask the server to skip
> that data.
>
Yeh, I read that and thought it was wrong. "At least once, in order" is
what we want, how it works and how it should be described.
> Worse, because we don't dirty the slot even a *clean shutdown* causes slot
> confirmed_lsn to go backwards. That's a bug IMO. We should force a flush of
> all slots at the shutdown checkpoint, whether dirty or not, to address it.
>
Given the earlier understanding, above, I'm not sure I see why that must
happen. But I agree it should happen.
> Barring objections I intend to submit a fix to:
>
> - Document that slots can replay data more than once
> - Force flush of all slots on a shutdown checkpoint
>
+1
> Also, pg_logical_slot_get_changes and its _peek_ variant should have a
> param specifying the starting LSN to read and return. If this is lower than
> the restart_lsn but non-null it should ERROR; if it's greater than or equal
> it should use this position instead of starting at the confirmed_lsn.
>
Maybe. I don't really like changing APIs. Can we add new funcs? Make sure a
NULL LSN can be passed as well.
Is the return type of pg_logical_slot_peek_changes() incorrect in the docs?
> Time permitting I'd like to add a pg_logical_slot_confirm function, so you
> can aternate _peek_changes and _confirm, making it possible to get
> walsender-interface-like behaviour via the SQL interface.
>
I thought thats what replorigins do.
> Right now you can't reliably use the SQL interface because _get_changes
> can succeed locally and advance the slot but the client can fail to receive
> all the changes due to network issues etc. Sure, the SQL interface is meant
> mainly for testing, but especially for !postgresql downstreams I strongly
> suspect people will want to use it for "real" work rather than have to
> modify each client driver to support replication protocol extensions.
>
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Oleg Bartunov | 2016-03-11 09:19:16 | Re: The plan for FDW-based sharding |
Previous Message | Daniel Gustafsson | 2016-03-11 08:40:23 | Re: Refectoring of receivelog.c |