Re: prevent immature WAL streaming

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, 蔡梦娟(玊于) <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
Subject: Re: prevent immature WAL streaming
Date: 2021-08-31 04:29:49
Message-ID: 20210831042949.52eqp5xwbxgrfank@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-08-23 18:52:17 -0400, Alvaro Herrera wrote:
> Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
> this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto
> variable").
>
> As mentioned in the course of thread [1], we're missing a fix for
> streaming replication to avoid sending records that the primary hasn't
> fully flushed yet. This patch is a first attempt at fixing that problem
> by retreating the LSN reported as FlushPtr whenever a segment is
> registered, based on the understanding that if no registration exists
> then the LogwrtResult.Flush pointer can be taken at face value; but if a
> registration exists, then we have to stream only till the start LSN of
> that registered entry.

I'm doubtful that the approach of adding awareness of record boundaries
is a good path to go down:

- It adds nontrivial work to hot code paths to handle an edge case,
rather than making rare code paths more expensive.

- There are very similar issues with promotions of replicas (consider
what happens if we need to promote with the end of local WAL spanning
a segment boundary, and what happens to cascading replicas). We have
some logic to try to deal with that, but it's pretty grotty and I
think incomplete.

- It seems to make some future optimizations harder - we should work
towards replicating data sooner, rather than the opposite. Right now
that's a major bottleneck around syncrep.

- Once XLogFlush() for some LSN returned we can write that LSN to
disk. The LSN doesn't necessarily have to correspond to a specific
on-disk location (it could e.g. be the return value from
GetFlushRecPtr()). But "rewinding" to before the last record makes that
problematic.

- I suspect that schemes with heuristic knowledge of segment boundary
spanning records have deadlock or at least latency spike issues. What
if synchronous commit needs to flush up to a certain record boundary,
but streaming rep doesn't replicate it out because there's segment
spanning records both before and after?

I think a better approach might be to handle this on the WAL layout
level. What if we never overwrite partial records but instead just
skipped over them during decoding?

Of course there's some difficulties with that - the checksum and the
length from the record header aren't going to be meaningful.

But we could deal with that using a special flag in the
XLogPageHeaderData.xlp_info of the following page. If that flag is set,
xlp_rem_len could contain the checksum of the partial record.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2021-08-31 05:01:58 Re: Fix around conn_duration in pgbench
Previous Message Julien Rouhaud 2021-08-31 04:19:12 Re: perlcritic: prohibit map and grep in void conext