Re: Logical archiving

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, boris(dot)novikov(at)acm(dot)org
Subject: Re: Logical archiving
Date: 2020-12-08 09:54:55
Message-ID: 2B44FA4B-7500-4B37-82BD-BFACA20001AD@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Craig!
Probably, I should better ask in your nearby thread about logical replication, it just seemed to me that logical archiving is somewhat small independent piece of functionality...

> 7 дек. 2020 г., в 08:05, Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com> написал(а):
>
> Reply follows inline. I addressed your last point first, so it's out of order.
>
> On Fri, 4 Dec 2020 at 15:33, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote
>
> > If OLAP cannot consume data fast enough - we are out of space due to repl slot.
>
> There is a much simpler solution to this than logical PITR.
>
> What we should be doing is teaching xlogreader how to invoke the restore_command to fetch archived WALs for decoding.

> Replication slots already have a WAL retention limit, but right now when that limit is reached the slot is invalidated and becomes useless, it's effectively dropped. Instead, if WAL archiving is enabled, we should leave the slot as valid. If a consumer of the slot needs WAL that no longer exists in pg_wal, we should have the walsender invoke the restore_command to read the missing WAL segment, decode it, and remove it again.
>
> This would not be a technically difficult patch, and it's IMO one of the more important ones for improving logical replication.
Currently we have restore_command in regular config, not in recovery.conf, so, probably, it should not be a very big deal to implement this.
>
> > I was discussing problems of CDC with scientific community and they asked this simple question: "So you have efficient WAL archive on a very cheap storage, why don't you have a logical archive too?"
>
> I've done work in this area, as has Petr (CC'd).
>
> In short, logical archiving and PITR is very much desirable, but we're not nearly ready for it yet and we're missing a lot of the foundations needed to make it really useful.
>
> IMO the strongest pre-requisite is that we need integrated DDL capture and replication in Pg. While this could be implemented in the publisher/subscriber logic for logical replication, it would make much more sense (IMO) to make it easier to feed DDL events into any logical replication output plugin.
>
> pglogical3 (the closed one) has quite comprehensive DDL replication support. Doing it is not simple though - there are plenty of complexities:
>
> * Reliably identifying the target objects and mapping them to replication set memberships for DML-replication
> * Capturing, replicating and managing the search_path and other DDL execution context (DateStyle and much more) reliably
> • Each statement type needs specific logic to indicate whether it needs DDL replication (and often filter functions since we have lots of sub-types where some need replication and some don't)
> • Handling DDL affecting global objects in pg_global correctly, like those affecting roles, grants, database security labels etc. There's no one right answer for this, it depends on the deployment and requires the user to cooperate.
> • Correct handling of transactions that mix DDL and DML (mostly only an issue for multimaster).
> • Identifying statements that target a mix of replicated and non-replicated objects and handling them appropriately, including for CASCADEs
> • Gracefully handling DDL statements that mix TEMPORARY and persistent targets. We can do this ok for DROPs but it still requires care. Anything else gets messier.
> • Lack of hooks into table rewrite operations and the extremely clumsy and inefficient way logical decoding currently exposes decoding of the temp-table data during decoding of rewrites means handling table-rewriting DDL is difficult and impractical to do correctly. In pglogical we punt on it entirely and refuse to permit DDL that would rewrite a table except where we can prove it's reliant only on immutable inputs so we can discard the upstream rewrite and rely on statement replication.
> • As a consequence of the above, reliably determining whether a given statement will cause a table rewrite.
> • Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc.
> • Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing something sensible for truncate cascade.
> • Probably more I've forgotten
>
> If we don't handle these, then any logical change-log archives will become largely useless as soon as there's any schema change.
>
> So we kind of have to solve DDL replication first IMO.
>
> Some consideration is also required for metadata management. Right now relation and type metadata has session-lifetime, but you'd want to be able to discard old logical change-stream archives and have the later ones still be usable. So we'd need to define some kind of restartpoint where we repeat the metadata, or we'd have to support externalizing the metadata so it can be retained when the main change archives get aged out.
>
> We'd also need to separate the existing apply worker into a "receiver" and "apply/writer" part, so the wire-protocol handling isn't tightly coupled with the actual change apply code, in order to make it possible to actually consume those archives and apply them to the database. In pglogical3 we did that by splitting them into two processes, connected by a shm_mq. Originally the process split was optional and you could run a combined receiver/writer process without the shm_mq if you wanted, but we quickly found it difficult to reliably handle locking issues etc that way so the writers all moved out-of-process.
>
> That was done mainly to make it possible to support parallelism in logical decoding apply. But we also have the intention of supporting an alternative reader process that can ingest "logical archives" and send them to the writer to apply them, as if they'd been received from the on-wire stream. That's not implemented at this time though. It'd be useful for a number of things:
>
> * PITR-style logical replay and recovery
> * Ability to pre-decode a txn once on the upstream then send the buffered protocol-stream to multiple subscribers, saving on logical decoding and reorder buffering overheads and write-multiplication costs
> * ability to ingest change-streams generated by non-postgres sources so we could support streaming foreign-data ingestion, streaming OLAP and data warehousing, etc
>
> To make logical PITR more useful we'd also want to be a bit more tolerant of schema divergence, though that's not overly hard to do:
> • fill defaults for downstream columns if no value is present for the column in the upstream row and the downstream column is nullable or has a default (I think built-in logical rep does this one already)
> • ignore values for columns in upstream data if the downstream table lacks the column and the upstream value is null
> • optionally allow apply to be configured to ignore non-null data in upstream columns that're missing on downstream
> • optionally allow apply to be configured to drop rows on the floor if the downstream table is missing
> • policies for handling data conflicts like duplicate PKs
> and we'd probably want ways to filter the apply data-stream to apply changes for only a subset of tables, rows, etc at least in a later version.
>
> None of this is insurmountable. Most or all of the DDL replication support and divergence-tolerance stuff is already done in production deployments using pglogical3 and bdr3.
I really like this wording for "divergence-tolerance" stuff, it captures problems I want to solve. I believe it's somewhat orthogonal to other issues.

>
> While I can't share the code, I am happy to share the experience I have gained from my part in working on these things. As you've probably recently seen with the wiki article I wrote on physical/logical failover interop.
>
> You're free to take information like this and use it in wiki articles too.
>
> Right now I won't be able to launch into writing big patches for these things, but I'll do my best to share what I can and review things.
>
> > This seems like a wild idea. But really, we have a super expensive NVMe drives for OLTP workload. And use this devices to store buffer for data to be dumped into MapReduce\YT analytical system.
>
> It's not a wild idea at all, as noted above.
>
> In pglogical3 we already support streaming decoded WAL data to alternative writer downstreams including RabbitMQ and Kafka via writer plugins.
Yes, Yandex.Cloud Transfer Manger supports it too. But it has to be resynced after physical failover. And internal installation of YC have mandatory drills: few times in a month one datacenter is disconnected and failover happens for thousands a DBS.

Thank you for your input. Probably, I'll put some efforts into loading missing WAL as a first step towards bright future :)

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2020-12-08 09:59:25 Re: Wrong check in pg_visibility?
Previous Message Michael Paquier 2020-12-08 09:44:55 Re: small cleanup in unicode_norm.c