Re: Refactor pg_rewind code and make it work against a standby

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactor pg_rewind code and make it work against a standby
Date: 2020-09-24 17:27:22
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20/09/2020 23:44, Soumyadeep Chakraborty wrote:
> Before getting into the code review for the patch, I wanted to know why
> we don't use a Bitmapset for target_modified_pages?

Bitmapset is not available in client code. Perhaps it could be moved to
src/common with some changes, but doesn't seem worth it until there's
more client code that would need it.

I'm not sure that a bitmap is the best data structure for tracking the
changed blocks in the first place. A hash table might be better if there
are only a few changed blocks, or something like
src/backend/lib/integerset.c if there are many. But as long as the
simple bitmap gets the job done, let's keep it simple.

> 2. Rename target_modified_pages to target_pages_to_overwrite?
> target_modified_pages can lead to confusion as to whether it includes pages
> that were modified on the target but not even present in the source and
> the other exclusionary cases. target_pages_to_overwrite is much clearer.

Agreed, I'll rename it.

Conceptually, while we're scanning source WAL, we're just making note of
the modified blocks. The decision on what to do with them happens only
later, in decide_file_action(). The difference is purely theoretical,
though. There is no real decision to be made, all the modified blocks
will be overwritten. So on the whole, I agree 'target_page_to_overwrite'
is better.

>> /*
>> * If this is a relation file, copy the modified blocks.
>> *
>> * This is in addition to any other changes.
>> */
>> iter = datapagemap_iterate(&entry->target_modified_pages);
>> while (datapagemap_next(iter, &blkno))
>> {
>> offset = blkno * BLCKSZ;
>> source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
>> }
>> pg_free(iter);
> Can we put this hunk into a static function overwrite_pages()?

Meh, it's only about 10 lines of code, and one caller.

> 4.
>> * block that have changed in the target system. It makes note of all the
>> * changed blocks in the pagemap of the file.
> Can we replace the above with:
>> * block that has changed in the target system. It decides if the given
> blkno in the target relfile needs to be overwritten from the source.

Ok. Again conceptually though, process_target_wal_block_change() just
collects information, and the decisions are made later. But you're right
that we do leave out truncated-away blocks here, so we are doing more
than just noting all the changed blocks.

>> /*
>> * Doesn't exist in either server. Why does it have an entry in the
>> * first place??
>> */
> Can we delete the above hunk and add the following assert to the very
> top of decide_file_action():
> Assert(entry->target_exists || entry->source_exists);

I would like to keep the check even when assertions are not enabled.
I'll add an Assert(false) there.

> 7. Please address the FIXME for the symlink case:
> /* FIXME: Check if it points to the same target? */

It's not a new issue. Would be nice to fix, of course. I'm not sure what
the right thing to do would be. If you have e.g. replaced
postgresql.conf with a symlink that points outside the data directory,
would it be appropriate to overwrite it? Or perhaps we should throw an
error? We also throw an error if a file is a symlink in the source but a
regular file in the target, or vice versa.

> 8.
> * it anyway. But there's no harm in copying it now.)
> and
> * copy them here. But we don't know which scenario we're
> * dealing with, and there's no harm in copying the missing
> * blocks now, so do it now.
> Could you add a line or two explaining why there is "no harm" in these
> two hunks above?

The previous sentences explain that there's a WAL record covering them.
So they will be overwritten by WAL replay anyway. Does it need more

> 14. queue_overwrite_range(), finish_overwrite() instead of
> queue_fetch_range(), finish_fetch()? Similarly update\
> *_fetch_file_range() and *_finish_fetch()
> 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c

Good idea! And fetch.h -> rewind_source.h.

I also moved the code in execute_file_actions() function to pg_rewind.c,
into a new function: perform_rewind(). It felt a bit silly to have just
execute_file_actions() in a file of its own. perform_rewind() now does
all the modifications to the data directory, writing the backup file.
Except for writing the recovery config: that also needs to be when
there's no rewind to do, so it makes sense to keep it separate. What do
you think?

> 16.
>> conn = PQconnectdb(connstr_source);
>> if (PQstatus(conn) == CONNECTION_BAD)
>> pg_fatal("could not connect to server: %s",
>> PQerrorMessage(conn));
>> if (showprogress)
>> pg_log_info("connected to server");
> The above hunk should be part of init_libpq_source(). Consequently,
> init_libpq_source() should take a connection string instead of a conn.

The libpq connection is also needed by WriteRecoveryConfig(), that's why
it's not fully encapsulated in libpq_source.

> 19.
>> typedef struct
>> {
>> const char *path; /* path relative to data directory root */
>> uint64 offset;
>> uint32 length;
>> } fetch_range_request;
> offset should be of type off_t

The 'offset' argument to the queue_fetch_range function is uint64, and
the argument to the SQL-callable pg_read_binary_file() isint8, so it's
consistent with them. Then again, the 'len' argument to
queue_fetch_range() is a size_t, and to pg_read_binary_file() int8, so
it's not fully consistent with that either. I'll try to make it more

Thanks for the review! Attached is a new version of the patch set.

- Heikki

Attachment Content-Type Size
v3-0001-pg_rewind-Move-syncTargetDirectory-to-file_ops.c.patch text/x-patch 4.5 KB
v3-0002-Refactor-pg_rewind-for-more-clear-decision-making.patch text/x-patch 30.6 KB
v3-0003-pg_rewind-Replace-the-hybrid-list-array-data-stru.patch text/x-patch 21.8 KB
v3-0004-pg_rewind-Refactor-the-abstraction-to-fetch-from-.patch text/x-patch 43.0 KB
v3-0005-Allow-pg_rewind-to-use-a-standby-server-as-the-so.patch text/x-patch 18.2 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2020-09-24 17:47:32 Re: proposal: possibility to read dumped table's name from file
Previous Message James Coleman 2020-09-24 16:51:55 Incremental sort docs and release announcement