Re: Timeline following for logical slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Timeline following for logical slots
Date: 2016-03-15 13:04:12
Message-ID: CAMsr+YHtG_4RpJyWYbg-aSTcwHrBnOQLCt03a1Q9ipZ_HqwwNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15 March 2016 at 17:12, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi
>

Thanks very much for the review.

This patch was split out from failover slots, which its self underwent
quite a few revisions, so I'm really happy to have fresh eyes on it.
Especially more experienced ones.

> On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
> > diff --git a/src/backend/access/transam/xlogreader.c
> b/src/backend/access/transam/xlogreader.c
> > index fcb0872..7b60b8f 100644
> > --- a/src/backend/access/transam/xlogreader.c
> > +++ b/src/backend/access/transam/xlogreader.c
> > @@ -10,9 +10,11 @@
> > *
> > * NOTES
> > * See xlogreader.h for more notes on this facility.
> > + *
> > + * This file is compiled as both front-end and backend code,
> so it
> > + * may not use ereport, server-defined static variables, etc.
> >
> *-------------------------------------------------------------------------
> > */
> > -
>
> Huh?
>

I'm not sure what the concern here is. xlogreader *is* compiled as frontend
code - the file gets linked into the tree for pg_xlogdump and pg_rewind, at
least.

I found that really confusing when working on it and thought it merited a
comment.

> > @@ -135,6 +142,10 @@ XLogReaderFree(XLogReaderState *state)
> > pfree(state->errormsg_buf);
> > if (state->readRecordBuf)
> > pfree(state->readRecordBuf);
> > +#ifndef FRONTEND
> > + if (state->timelineHistory)
> > + list_free_deep(state->timelineHistory);
> > +#endif
>
> Hm. So we don't support timelines following for frontend code, although
> it'd be rather helpful for pg_xlogdump. And possibly pg_rewind.
>

Yes, it would. I don't want to address that in the same patch though. It'd
require making timeline.c frontend-clean, dealing with the absence of List
on the frontend, etc, and I don't want to complicate this patch with that.

I've intentionally written the timeline logic so it can pretty easily be
moved into xlogreader.c as a self-contained unit and used for those
utilities once timeline.c can be compiled for frontend too.

> pfree(state->readBuf);
> > pfree(state);
> > }
> > @@ -208,10 +219,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr
> RecPtr, char **errormsg)
> >
> > if (RecPtr == InvalidXLogRecPtr)
> > {
> > + /* No explicit start point; read the record after the one
> we just read */
> > RecPtr = state->EndRecPtr;
> >
> > if (state->ReadRecPtr == InvalidXLogRecPtr)
> > - randAccess = true;
> > + randAccess = true; /* allow readPageTLI to go
> backwards */
>
> randAccess is doing more than that, so I'm doubtful that comment is an
> improvment.
>

Yeah, I have no idea what I was on about there, per response to Álvaro's
post.

> > @@ -466,9 +482,7 @@ err:
> > * Invalidate the xlog page we've cached. We might read from a
> different
> > * source after failure.
> > */
> > - state->readSegNo = 0;
> > - state->readOff = 0;
> > - state->readLen = 0;
> > + XLogReaderInvalCache(state);
>
> I don't think that "cache" is the right way to describe this.
>

Isn't that what it is? It reads a page, caches it, and reuses it for
subsequent requests on the same page. The pre-existing comment even calls
it a cache above.

I don't mind changing it, but don't have any better ideas.

> > #include <unistd.h>
> >
> > -#include "miscadmin.h"
> > -
>
> spurious change imo.
>

Added in Álvaro's rev; it puts the header in the correct sort order, but
I'm not sure it should be bundled with this patch.

> > - if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
> > + /* Do we need to open a new xlog segment? */
> > + if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
> > + sendTLI != tli)
> > {
>
> s/open a new/open a different/? New imo has connotations that we don't
> really want here.
>

In my original patch this was:

/* Do we need to switch to a new xlog segment? */

but yeah, "open a different" is better than either.

> > /* Need to seek in the file? */
> > if (sendOff != startoff)
> > {
> > if (lseek(sendFile, (off_t) startoff, SEEK_SET) <
> 0)
> > - {
> > - char path[MAXPGPATH];
> > -
> > - XLogFilePath(path, tli, sendSegNo);
> > -
> > ereport(ERROR,
> > (errcode_for_file_access(),
> > errmsg("could not seek in log segment %s
> to offset %u: %m",
> > - path, startoff)));
> > - }
> > + XLogFileNameP(tli,
> sendSegNo), startoff)));
> > sendOff = startoff;
> > }
>
> Not a serious issue, more a general remark: I'm doubtful that going for
> palloc in error situations is good practice. This will be allocated in
> the current memory context; without access to the emergency error
> reserves.
>

I was getting pretty confused, since I was sure I didn't write that. My
memory's bad enough that sometimes I go "huh, ok, guess I did"... but in
this case it wasn't in my patch so I think Álvaro's added it.

Agree that it's unrelated and probably better how it was.

> > +static void
> > +XLogReadDetermineTimeline(XLogReaderState *state)
> > +{
> > + /* Read the history on first time through */
> > + if (state->timelineHistory == NIL)
> > + state->timelineHistory =
> readTimeLineHistory(ThisTimeLineID);
> > +
> > + /*
> > + * Are we reading the record immediately following the one we read
> last
> > + * time? If not, then don't use the cached timeline info.
> > + */
> > + if (state->currRecPtr != state->EndRecPtr)
> > + {
> > + state->currTLI = 0;
> > + state->currTLIValidUntil = InvalidXLogRecPtr;
> > + }
>
>
> Hm. So we grow essentially a second version of the last end position and
> the randAccess stuff in XLogReadRecord().
>

Yeah, and in an earlier version of this patch that's where it lived.

I landed up moving it into its own self-contained function mainly because
the xlog read callback needs to be able to re-read the timeline info and
re-check the timeline end if the current timeline becomes historical while
it's waiting for new WAL, which could happen if it's a cascading standby
and its parent got promoted. Hence the call to XLogReadDetermineTimeline
from within read_local_xlog_page(...). That can't actually happen right now
since logical decoding can't be done on a standby yet, but I didn't want to
introduce new problems to fix for that when adding timeline following
support.

XLogReadRecord(...) could clear this info instead of doing it in
XLogReadDetermineTimeline(...), but I thought it made more sense to keep
use of that state local to XLogReadDetermineTimeline(...) rather than
scatter it.

> > + if (state->currTLI == 0)
> > + {
> > + /*
> > + * Something changed; work out what timeline this record
> is on. We
> > + * might read it from the segment on this TLI or, if the
> segment is
> > + * also contained by newer timelines, the copy from a
> newer TLI.
> > + */
> > + state->currTLI = tliOfPointInHistory(state->currRecPtr,
> > +
> state->timelineHistory);
> > +
> > + /*
> > + * Look for the most recent timeline that's on the same
> xlog segment
> > + * as this record, since that's the only one we can assume
> is still
> > + * readable.
> > + */
> > + while (state->currTLI != ThisTimeLineID &&
> > + state->currTLIValidUntil == InvalidXLogRecPtr)
> > + {
> > + XLogRecPtr tliSwitch;
> > + TimeLineID nextTLI;
> > +
> > + tliSwitch = tliSwitchPoint(state->currTLI,
> state->timelineHistory,
> > +
> &nextTLI);
> > +
> > + /* round ValidUntil down to start of seg
> containing the switch */
> > + state->currTLIValidUntil =
> > + ((tliSwitch / XLogSegSize) * XLogSegSize);
> > +
> > + if (state->currRecPtr >= state->currTLIValidUntil)
> > + {
> > + /*
> > + * The new currTLI ends on this WAL
> segment so check the next
> > + * TLI to see if it's the last one on the
> segment.
> > + *
> > + * If that's the current TLI we'll stop
> searching.
>
> I don't really understand how we're stopping searching here?
>

What I'm doing here is looking for the newest timeline that exists in this
WAL segment. Each time through we advance currTLI if we find that there's a
newer timeline on this segment then look again. The loop ends if we find
that the newest timeline on the segment is the current timeline being
written/replayed by the server (in which case we know that segment must
still exist and not have been renamed to .partial) or until we find that
currTLI's validity extends past this segment.

There is some explanation for this in the comments at the start of the
function since it affects the overall logic.

On reading it again I think that testing against state->currRecPtr is
confusing here. It relies on the fact that currTLIValidUntil has been
rounded down to the LSN of the start of the segment, so if the current
record pointer is greater than it we know the timeline ends somewhere on
this segment.

I guess it could be clearer (but verbose) to define an XLogSegNoFromLSN
macro then

if ( XLogSegNoFromLSN(state->currRecPtr) >=
XLogSegNoFromLSN(state->currTLIValidUntil))

but ... eh.

The alternative seems to be to search the timeline history info directly to
find the most recent timeline in the segment, starting from the current
timeline being read.

> > + */
> > + state->currTLI = nextTLI;
> > + state->currTLIValidUntil =
> InvalidXLogRecPtr;
> > + }
> > + }
> > +}
>
>
> XLogReadDetermineTimeline() doesn't sit quite right with me, I do wonder
> whether there's not a simpler way to write this.
>

If there is I'd be quite happy. It took me some time to figure out the
wrinkles here.

You can't do this where you'd expect, in XLogReadPage. Partly because it
gets used by frontend code too, as discussed above. Partly because the
xlogreader callback is responsible for waiting for new WAL rather than
XLogReadPage, and as noted above that would cause issues when a cascading
standby gets promoted while we're waiting for WAL. There's no way to pass
the wanted timeline to the callback anyway, but if that were the only issue
I'd have just added it (and did, in earlier versions of this patch).

Additionally, xlogreader and XLogReadPage is used by the physical
replication walsender which neither needs nor wants timeline following - it
expects to return failure when it runs out of data on the timeline instead,
so the client can manage the timeline switch after the CopyBoth data stream
ends.

It's not desirable to do the timeline switch for logical decoding at a
higher level, before calling into the walsender, because then it has to be
done separately in the SQL interface and walsender interface for logical
decoding, similarly to how the client does it for physical replication. I
actually did it this way in the proof of concept version and it works fine,
it's just more intrusive and ugly and duplicates logic in multiple places.

There's more to it than that but I'm tired after a long day. I'll try to
write that timelines readme after I review my notes so I can explain better.

As for the actual mechanism by which XLogReadDetermineTimeline operates,
the main thing I wonder is whether it can be usefully simplified by having
it directly scan the loaded timeline history and determine the last
timeline for a segment. I'm not convinced there's much of a way around
needing the rest of the logic.

> +/*
> > + * XLogPageReadCB callback for reading local xlog files
> > *
> > * Public because it would likely be very helpful for someone writing
> another
> > * output method outside walsender, e.g. in a bgworker.
> > *
> > - * TODO: The walsender has it's own version of this, but it relies on
> the
> > + * TODO: The walsender has its own version of this, but it relies on the
> > * walsender's latch being set whenever WAL is flushed. No such
> infrastructure
> > * exists for normal backends, so we have to do a check/sleep/repeat
> style of
> > * loop for now.
> > @@ -754,46 +897,88 @@ int
> > read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
> > int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID
> *pageTLI)
> > {
> > - XLogRecPtr flushptr,
> > + XLogRecPtr read_upto,
> > loc;
> > int count;
> >
> > loc = targetPagePtr + reqLen;
> > +
> > + /* Make sure enough xlog is available... */
> > while (1)
> > {
> > /*
> > - * TODO: we're going to have to do something more
> intelligent about
> > - * timelines on standbys. Use readTimeLineHistory() and
> > - * tliOfPointInHistory() to get the proper LSN? For now
> we'll catch
> > - * that case earlier, but the code and TODO is left in
> here for when
> > - * that changes.
> > + * Check which timeline to get the record from.
> > + *
> > + * We have to do it each time through the loop because if
> we're in
> > + * recovery as a cascading standby, the current timeline
> might've
> > + * become historical.
> > */
> > - if (!RecoveryInProgress())
> > + XLogReadDetermineTimeline(state);
> > +
> > + if (state->currTLI == ThisTimeLineID)
> > {
> > - *pageTLI = ThisTimeLineID;
> > - flushptr = GetFlushRecPtr();
> > + /*
> > + * We're reading from the current timeline so we
> might have to
> > + * wait for the desired record to be generated
> (or, for a standby,
> > + * received & replayed)
> > + */
> > + if (!RecoveryInProgress())
> > + {
> > + *pageTLI = ThisTimeLineID;
> > + read_upto = GetFlushRecPtr();
> > + }
> > + else
> > + read_upto = GetXLogReplayRecPtr(pageTLI);
> > +
> > + if (loc <= read_upto)
> > + break;
> > +
> > + CHECK_FOR_INTERRUPTS();
> > + pg_usleep(1000L);
> > }
> > else
> > - flushptr = GetXLogReplayRecPtr(pageTLI);
> > + {
> > + /*
> > + * We're on a historical timeline, so limit
> reading to the switch
> > + * point where we moved to the next timeline.
> > + */
> > + read_upto = state->currTLIValidUntil;
>
> Hm. Is it ok to not check GetFlushRecPtr/GetXLogReplayRecPtr() here? If
> so, how come?
>

We're reading from a segment where the newest timeline on that segment is
historical, i.e. the server has since replayed WAL from a newer timeline on
a more recent segment. We therefore know that there must be a complete
segment and we can't ever need to wait for new WAL.

An assertion that loc <= GetFlushRecPtr() wouldn't hurt.

> > - /* more than one block available */
> > - if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
> > + if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
> > + {
> > + /*
> > + * more than one block available; read only that block,
> have caller
> > + * come back if they need more.
> > + */
> > count = XLOG_BLCKSZ;
> > - /* not enough data there */
> > - else if (targetPagePtr + reqLen > flushptr)
> > + }
> > + else if (targetPagePtr + reqLen > read_upto)
> > + {
> > + /* not enough data there */
> > return -1;
> > - /* part of the page available */
> > + }
> > else
> > - count = flushptr - targetPagePtr;
> > + {
> > + /* enough bytes available to satisfy the request */
> > + count = read_upto - targetPagePtr;
> > + }
> >
> > - XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ);
> > + XLogRead(cur_page, *pageTLI, targetPagePtr, count);
>
> When are we reading less than a page? That should afaik never be required.
>

Because pages are pre-allocated and zeroed it's safe to read from the last
page past the point we've actually written WAL to. But in that case why do
we bother determining 'count' in the first place, only to ignore it?

This is largely cosmetic TBH.

> + /*
> > + * We start reading xlog from the restart lsn, even though
> in
> > + * CreateDecodingContext we set the snapshot builder up
> using the
> > + * slot's candidate_restart_lsn. This means we might read
> xlog we
> > + * don't actually decode rows from, but the snapshot
> builder might
> > + * need it to get to a consistent point. The point we
> start returning
> > + * data to *users* at is the candidate restart lsn from
> the decoding
> > + * context.
> > + */
>
> Uh? Where are we using candidate_restart_lsn that way?

That's a major brain-fart - it should've been confirmed_flush. I'm glad you
caught that, since it took me some time to figure out why logical decoding
was trying to read WAL I hadn't asked for, but adding an explanatory
comment that's *wrong* sure doesn't help the next person.

I'm quite impressed and disturbed that I managed to write out as "candidate
restart lsn" instead of "confirmed lsn" in full as well. (Now, where were
my glasses, I swear I had them a minute ago... oh, they're on my face!)

The important bit is that where we start reading xlog is the restart_lsn of
the slot, even though we won't return anything we decoded to the user until
we reach confirmed_flush, which is looked up by CreateDecodingContext
completely independently of what pg_logical_slot_get_changes_guts does,
because we passed InvalidXLogRecPtr to CreateDecodingContext to tell it to
automagically look up the confirmed_lsn. Maybe it'd be clearer to just pass
the confirmed_lsn to CreateDecodingContext
from pg_logical_slot_get_changes_guts. When I was debugging this stuff I
found it pretty confusing that the LSN we started reading xlog at was
different to the LSN the user wants decoded changes from, hence the comment.

> @@ -299,6 +312,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo
> fcinfo, bool confirm, bool bin
> > CHECK_FOR_INTERRUPTS();
> > }
> >
> > + /* Make sure timeline lookups use the start of the next
> record */
> > + startptr = ctx->reader->EndRecPtr;
>
> Huh? startptr isn't used after this, so I'm not sure what this even
> means?
>

That snuck in from a prior revision where timeline following was done in
pg_logical_slot_get_changes_guts, before I moved it down into the
xlogreader callback so it could be shared between the walsender and SQL
interfaces for logical decoding. It's bogus.

> > + /*
> > + * The XLogReader will read a page past the valid end of
> WAL because
> > + * it doesn't know about timelines. When we switch
> timelines and ask
> > + * it for the first page on the new timeline it will think
> it has it
> > + * cached, but it'll have the old partial page and say it
> can't find
> > + * the next record. So flush the cache.
> > + */
> > + XLogReaderInvalCache(ctx->reader);
> > +
>
> dito.
>

Ditto ;)

I really should've caught that. The problem is I've spent way, way too long
staring at this code over too many revisions as I figured out what T.F. was
going on with timeline following and slots. That's exactly why I really
appreciate the fresh eyes.

> > + * Create a new logical slot, with invalid LSN and xid, directly. This
> does not
> > + * use the snapshot builder or logical decoding machinery. It's only
> intended
> > + * for creating a slot on a replica that mirrors the state of a slot on
> an
> > + * upstream master.
> > + *
> > + * You should immediately decoding_failover_advance_logical_slot(...) it
> > + * after creation.
> > + */
>
> Uh. I doubt we want this, even if it's formally located in
> src/test/modules. These comments make it appear not to be only intended
> for that, and I have serious doubts about the validity of the concept as
> is.
>

As I expressed to Álvaro, I don't really mind if this test module and the
associated round of t/ tests that rely on it get cut, so long as the first
round of t/ tests that use a filesystem level copy to clone the slot are
kept so there's some test coverage for timeline following in logical slots.

I wrote it as a PoC to show that it worked, and the best way to do that was
to write it as a new test suite component. Since it seemed useful to test
logical decoding timeline following for a slot created after a base backup
is made or cloned after pg_basebackup discarded the slots, I included it.

The approach isn't ideal. It's what I plan to do for pglogical if failover
slots doesn't make the cut for 9.6, though. The main problem is that
because the slot updates don't come through WAL they can lag behind the
catalog_xmin the master's using to make decisions about vacuuming the
catalogs. If the master advances catalog_xmin on a slot then starts writing
the resulting vacuum activity to WAL, and if we replay that WAL *before* we
see the slot advance and sync it to the replica, the replica's slot will
have an incorrect catalog_xmin that does not reflect the actual on-disk
state. Not great. However, if the client is keeping track of its
confirmed_lsn (as it must, for correctness) then we know it'll never ask to
replay anything older than what it already sent confirmation to the old
master for, before failover. Since that's how the slot got advanced and got
a new catalog_xmin. That means we won't be attempting to decode anything in
the range where the recorded catalog_xmin on the promoted standby after
failover would be a problem. The slot will advance to a sensible position
when the client specifies the new start lsn.

At least, that's my reading of things, and that's what my tests have shown
so far. We do start decoding from restart_lsn, so we'll be decoding WAL in
the range where catalog_xmin is lying about what's really in the heap, but
I don't see anywhere where we're looking at it. It's just collecting up
transaction information at that point, right?

This same issue will occur if we attempt to do failover slots v2 for 9.7
using non-WAL transport to allow decoding from a replica with failover to
cascading standbys, as you mentioned wanting earlier. We'd have to have a
way to make sure the slot state on the replica was updated before we
replayed past the point in WAL where that slot was updated to the same
state on the master.

To that end I've thought about proposing a hook to let plugins intercept
slot write-out. That way they can take note of the current server LSN and
slot state, then make sure they sync that over to the replica before it
replays WAL past that LSN. I was really hoping to get failover slots in
place so this wouldn't be necessary but it's not looking too promising, and
this would provide a somewhat safer way to capture slot advances than just
peeking at pg_replication_slots but without having to get the full failover
slots stuff in. Having this would at least eliminate the possibility of
catalog_xmin being wrong on the replica.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Reiss 2016-03-15 13:17:10 Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Previous Message Julien Rouhaud 2016-03-15 12:39:38 Re: Minor bug affecting ON CONFLICT lock wait log messages