subscriptioncheck failure

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: David Pirotte <dpirotte(at)gmail(dot)com>, Euler Taveira <euler(at)timbira(dot)com(dot)br>
Subject: subscriptioncheck failure
Date: 2021-05-13 09:42:30
Message-ID: CAA4eK1+uW1UGDHDz-HWMHMen76mKP7NJebOTZN4uwbyMjaYVww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I noticed $SUBJECT in curculio.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23

The failure test report:
t/020_messages.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 5 tests but ran 1.
Files=21, Tests=149, 79 wallclock secs ( 0.03 usr 0.03 sys + 22.36
cusr 9.47 csys = 31.89 CPU)
Result: FAIL
Makefile:21: recipe for target 'check' failed
gmake: *** [check] Error 1

As far as I can understand the failure happens due to error:
2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
replication slot "tap_sub" is active for PID 7354
2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
OFFSET 1 LIMIT 1

As per my initial analysis, this happens because in the test after
creating a subscription, we are not waiting for the subscriber to
catch up. I think there is a narrow window where it is possible that
after creating a subscription, the apply worker starts replication
even after we have done Alter Subscription .. DISABLE. This is also
visible in buildfarm logs:

2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
SELECT get_byte(data, 0)
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
...

2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG: received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
(proto_version '2', publication_names '"tap_pub"')

Notice that we have checked the messages before receiving the start
replication command from the subscriber. Now, if we would have waited
for the subscription to catch up then our check in pg_stat_replication
wouldn't have given a false positive. I have noticed that we do have
such wait in other subscription tests but somehow missed it here.

Thoughts?

--
With Regards,
Amit Kapila.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-05-13 09:49:48 Re: RFC: Logging plan of the running query
Previous Message Bharath Rupireddy 2021-05-13 09:38:50 Re: RFC: Logging plan of the running query