From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Race condition in recovery? |
Date: | 2021-06-03 20:33:14 |
Message-ID: | CA+Tgmoax2CvOBNZt4urYCBVp9LgczJMK7befQLgj-8bP+jmbdQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, May 27, 2021 at 2:26 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> Changed as suggested.
I don't think the code as written here is going to work on Windows,
because your code doesn't duplicate enable_restoring's call to
perl2host or its backslash-escaping logic. It would really be better
if we could use enable_restoring directly. Also, I discovered that the
'return' in cp_history_files should really say 'exit', because
otherwise it generates a complaint every time it's run. It should also
have 'use strict' and 'use warnings' at the top.
Here's a version of your test case patch with the 1-line code fix
added, the above issues addressed, and a bunch of cosmetic tweaks.
Unfortunately, it doesn't pass for me consistently. I'm not sure if
that's because I broke something with my changes, or because the test
contains an underlying race condition which we need to address.
Attached also are the log files from a failed run if you want to look
at them. The key lines seem to be:
2021-06-03 16:16:53.984 EDT [47796] LOG: restarted WAL streaming at
0/3000000 on timeline 2
2021-06-03 16:16:54.197 EDT [47813] 025_stuck_on_old_timeline.pl LOG:
statement: SELECT count(*) FROM tab_int
2021-06-03 16:16:54.197 EDT [47813] 025_stuck_on_old_timeline.pl
ERROR: relation "tab_int" does not exist at character 22
Or from the main log:
Waiting for replication conn cascade's replay_lsn to pass '0/3000000' on standby
done
error running SQL: 'psql:<stdin>:1: ERROR: relation "tab_int" does not exist
LINE 1: SELECT count(*) FROM tab_int
^'
I wonder whether that problem points to an issue with this incantation:
$node_standby->wait_for_catchup($node_cascade, 'replay',
$node_standby->lsn('replay'));
But I'm not sure, and I'm out of time to investigate for today.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v4-0001-Fix-corner-case-failure-of-new-standby-to-follow-.patch | application/octet-stream | 5.9 KB |
025_stuck_on_old_timeline_primary.log | application/octet-stream | 3.0 KB |
025_stuck_on_old_timeline_cascade.log | application/octet-stream | 3.2 KB |
regress_log_025_stuck_on_old_timeline | application/octet-stream | 6.7 KB |
025_stuck_on_old_timeline_standby.log | application/octet-stream | 4.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2021-06-03 20:39:40 | Re: CALL versus procedures with output-only arguments |
Previous Message | Tom Lane | 2021-06-03 20:21:22 | Re: CALL versus procedures with output-only arguments |