Re: Race-like failure in recovery/t/009_twophase.pl

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Race-like failure in recovery/t/009_twophase.pl
Date: 2017-07-02 20:40:49
Message-ID: 2588.1499028049@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> The reporting critters are all on the slow side, so I suspected
> a timing problem, especially since it only started to show up
> after changing pg_ctl's timing behavior. I can't reproduce it
> locally on unmodified sources, but I could after putting my thumb
> on the scales like this:
>
> diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
> index ea9d21a..7ef22c1 100644
> *** a/src/backend/replication/walreceiver.c
> --- b/src/backend/replication/walreceiver.c
> *************** WalReceiverMain(void)
> *** 446,451 ****
> --- 446,454 ----
> * Something was received from master, so reset
> * timeout
> */
> +
> + pg_usleep(100000); /* be slow about processing */
> +
> last_recv_timestamp = GetCurrentTimestamp();
> ping_sent = false;
> XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);

BTW, I neglected to mention that this hack also produces a second failure
in the recovery test suite:

[20:28:31] t/001_stream_rep.pl .................. 11/28
# Failed test 'xmin of cascaded slot null with no hs_feedback'
# at t/001_stream_rep.pl line 178.
# got: undef
# expected: ''

# Failed test 'catalog xmin of cascaded slot null with no hs_feedback'
# at t/001_stream_rep.pl line 179.
# got: undef
# expected: ''
[20:28:31] t/001_stream_rep.pl .................. 28/28 # Looks like you failed 2 tests of 28.
[20:28:31] t/001_stream_rep.pl .................. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/28 subtests

I had supposed last night that this was merely a matter of needing
to insert a wait_slot_xmins() call before those tests, but doing so
doesn't help. And, looking closer, the report is complaining that
$xmin and $catalog_xmin are coming back as undef, not empty strings.
My Perl-fu is too weak to figure out how that could possibly be ---
the documentation for PostgresNode::query_hash certainly states
flatly that it will never happen. And it even more certainly
shouldn't happen if we already successfully probed the
pg_replication_slots row in wait_slot_xmins().

The corresponding entries in the postmaster's log don't look abnormal:

2017-07-02 20:07:29.313 UTC [30862] LOG: database system is ready to accept read only connections
2017-07-02 20:07:29.318 UTC [30867] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
2017-07-02 20:07:29.415 UTC [30870] t/001_stream_rep.pl LOG: statement: SELECT pg_create_physical_replication_slot('standby_2');
2017-07-02 20:07:29.619 UTC [30863] LOG: redo starts at 0/3029458
2017-07-02 20:07:29.665 UTC [30882] t/001_stream_rep.pl LOG: statement:
SELECT xmin IS NULL AND catalog_xmin IS NULL
FROM pg_catalog.pg_replication_slots
WHERE slot_name = 'standby_2';

2017-07-02 20:07:29.679 UTC [30884] t/001_stream_rep.pl LOG: statement: SELECT plugin, slot_type, datoid, database, active, active_pid, xmin, catalog_xmin, restart_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = 'standby_2'

(This is from a run with a wait_slot_xmins() call added, else the
statement logged by PID 30882 would not have been issued.)

Any ideas what's wrong there?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-07-02 21:10:31 Re: Race-like failure in recovery/t/009_twophase.pl
Previous Message Andres Freund 2017-07-02 19:12:22 Re: [HACKERS] Segmentation fault in libpq