Re: Allow replication roles to use file access functions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Allow replication roles to use file access functions
Date: 2015-09-03 02:40:31
Message-ID: CAB7nPqSw9SOY1vZZcNGE3EPJnSZzPTeGfj49C+54Jgwf2V-b2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 3, 2015 at 11:20 AM, Stephen Frost wrote:
> * Michael Paquier wrote:
>> 1) Use a differential backup to me, or the possibility to fetch a set
>> of data block diffs from a source node using an LSN and then re-apply
>> them on the target node. The major disadvantage of this approach is
>> actually performance: we would need to scan the source node
>> completely, so that's slow. And pg_rewind is fast because it only
>> scans the WAL records of the node to be rewound from the last
>> checkpoint before WAL forked.
>
> I don't follow this performance concern at all. Why can't pg_rewind
> look through the WAL and find what it needs, and then request exactly
> that information? Clearly, we would need to add to the existing
> replication protocol, but I don't see any reason to not consider that a
> perfectly reasonable approach for this.

See below, visibly I misunderstood what you meant.

>> 2) Request data blocks from the source node using the replication
>> protocol, that's what pg_read_binary_file actually does, we just don't
>> have the logic on replication protocol side, though we could.
>
> Right, we would need to modify the replication protocol to allow such
> requests, but that's not particularly difficult.

Check.

>> 3) Create a new set of functions similar to the existing file access
>> functions that are usable by the replication user, except that they
>> refuse to return file entries that match the existing filters in
>> basebackup.c. That is doable, with a routine in basebackup.c that
>> decides if a given file string can be read or not. Base backups could
>> use this routine as well.
>
> I don't particularly like this approach as it implies SQL access for the
> replication role.

Check.

>> > This is definitely a big part of
>> > the question, but I'd like to ask- what, exactly, does pg_rewind
>> > actually need access to? Is it only the WAL, or are heap and WAL files
>> > needed?
>>
>> Not only, +clog, configuration files, etc.
>
> Configuration files? Perhaps you could elaborate?

Sure. Sorry for being unclear. It copies everything that is not a
relation file, a kind of base backup without the relation files then.

>> > Consider the discussion about delta backups, et al, using LSNs. Perhaps
>> > the replication protocol should be extended to allow access to arbitrary
>> > WAL, querying what WAL is available, and access to the heap files,
>> > perhaps even specific pages in the heap files and relation forks,
>> > instead of giving pg_rewind access to these extremely general
>> > nearly-OS-user-level functions.
>>
>> The problem when using differential backups in this case is
>> performance as mentioned above. We would need to scan the whole target
>> cluster, which may take time, the current approach of pg_rewind only
>> needs to scan WAL records to find the list of blocks modified, and
>> directly requests them from the source. I would expect pg_rewind to be
>> as quick as possible.
>
> I don't follow why the current approach of pg_rewind would have to
> change. All I'm suggesting is that we have a different way, one which
> is much more restricted, for pg_rewind to request exactly the
> information it needs for efficient operation.

Ah, OK. I thought that you were referring to a protocol where caller
sends a single LSN from which it gets a differential backup that needs
to scan all the relation files of the source cluster to get the data
blocks with an LSN newer than the one sent, and then sends them back
to the caller.

I guess that what you are suggesting instead is an approach where
caller sends something like that through the replication protocol with
a relation OID and a block list:
BLOCK_DIFF relation_oid BLOCK_LIST m,n,[o, ...]
Which is close to what pg_read_binary_file does now for a superuser.
We would need as well to extend BASE_BACKUP so as it does not include
relation files though for this use case.

Regards,
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Satoshi Nagayasu 2015-09-03 02:41:15 Re: pg_stat_statements query jumbling question
Previous Message Michael Paquier 2015-09-03 02:26:11 Re: Horizontal scalability/sharding