Re: A failure of standby to follow timeline switch

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: A failure of standby to follow timeline switch
Date: 2021-01-08 20:08:43
Message-ID: 20210108200843.GA26309@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Masao-san: Are you intending to act as committer for these? Since the
bug is mine I can look into it, but since you already did all the
reviewing work, I'm good with you giving it the final push.

0001 looks good to me; let's get that one committed quickly so that we
can focus on the interesting stuff. While the implementation of
find_in_log is quite dumb (not this patch's fault), it seems sufficient
to deal with small log files. We can improve the implementation later,
if needed, but we have to get the API right on the first try.

0003: The fix looks good to me. I verified that the test fails without
the fix, and it passes with the fix.

The test added in 0002 is a bit optimistic regarding timing, as well as
potentially slow; it loops 1000 times and sleeps 100 milliseconds each
time. In a very slow server (valgrind or clobber_cache animals) this
could not be sufficient time, while on fast servers it may end up
waiting longer than needed. Maybe we can do something like this:

for (my $i = 0 ; $i < 1000; $i++)
{
my $current_log_size = determine_current_log_size()

if ($node_standby_3->find_in_log(
"requested WAL segment [0-9A-F]+ has already been removed",
$logstart))
{
last;
}
elsif ($node_standby_3->find_in_log(
"End of WAL reached on timeline",
$logstart))
{
$success = 1;
last;
}
$logstart = $current_log_size;

while (determine_current_log_size() == current_log_size)
{
usleep(10_000);
# with a retry count?
}
}

With test patch, make check PROVE_FLAGS="--timer" PROVE_TESTS=t/001_stream_rep.pl

ok 6386 ms ( 0.00 usr 0.00 sys + 1.14 cusr 0.93 csys = 2.07 CPU)
ok 6352 ms ( 0.00 usr 0.00 sys + 1.10 cusr 0.94 csys = 2.04 CPU)
ok 6255 ms ( 0.01 usr 0.00 sys + 0.99 cusr 0.97 csys = 1.97 CPU)

without test patch:

ok 4954 ms ( 0.00 usr 0.00 sys + 0.71 cusr 0.64 csys = 1.35 CPU)
ok 5033 ms ( 0.01 usr 0.00 sys + 0.71 cusr 0.73 csys = 1.45 CPU)
ok 4991 ms ( 0.01 usr 0.00 sys + 0.73 cusr 0.59 csys = 1.33 CPU)

--
Álvaro Herrera

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2021-01-08 20:33:44 Re: Key management with tests
Previous Message Alvaro Herrera 2021-01-08 19:22:00 Re: support for MERGE