Re: Race conditions in 019_replslot_limit.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-02-17 02:14:44
Message-ID: 20220217021444.wnaxfigeyvogyzi3@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-16 20:22:23 -0500, Tom Lane wrote:
> There's no disconnection log entry for either, which I suppose means
> that somebody didn't bother logging disconnection for walsenders ...

The thing is, we actually *do* log disconnection for walsenders:

2022-02-16 17:53:55.780 PST [2459806][walsender][:0] LOG: disconnection: session time: 0:00:01.321 user=andres database= host=[local]

The only way I see for the disconnection not to be logged is an immediate
shutdown.

It'd be one thing if the missing walsender disconnect wasn't logged because we
shut down just after. But unless I misunderstand something, between the
basebackup and the failing test, we actually start standby_3:

# Running: pg_basebackup -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_primary3_data/backup/my_backup -h /tmp/hyrtnyd2RU -p 53774 --checkpoint fast --no-sync
# Backup finished

...

# Running: pg_ctl -w -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/t_019_replslot_limit_standby_3_data/pgdata -l /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/test/recovery/tmp_check/log/019_replslot_limit_standby_3.log -o --cluster-name=standby_3 start

and also wait till replication has caught up:

Waiting for replication conn standby_3's replay_lsn to pass 0/700000 on primary3
done

and only then we have

not ok 17 - have walsender pid 3682154
# 3682136

# Failed test 'have walsender pid 3682154
# 3682136'
# at t/019_replslot_limit.pl line 335.
# '3682154
# 3682136'
# doesn't match '(?^:^[0-9]+$)'

Starting a node in recovery and having it connect to the primary seems like a
mighty long time for a process to exit, unless it's stuck behind something.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-02-17 02:17:52 Re: killing perl2host
Previous Message Tom Lane 2022-02-17 02:12:12 Re: libpq async duplicate error results