Re: Race conditions in 019_replslot_limit.pl

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: andres(at)anarazel(dot)de, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-02-16 05:58:23
Message-ID: CAD21AoCmphWmVNWHa3uPDF=12z3z8yqQnBMpfSGVb+kGkvux2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund <andres(at)anarazel(dot)de> wrote in
> > I think what happened is that there was no WAL to receive between the start of
> > the primary and the $node_primary3->wait_for_catchup($node_standby3);
> >
> > Because the slot is created without reserving WAL that allows the primary to
> > remove the WAL segments without ever creating a slot based conflict. I think
> > that should be fixable by reserving the slot at creation time?
>
> Agreed. Doing this att all slot creation seems fine.
>
> > > and:
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08
> > >
> > > # Failed test 'have walsender pid 3682154
> > > # 3682136'
> > > # at t/019_replslot_limit.pl line 335.
> > > # '3682154
> > > # 3682136'
> > > # doesn't match '(?^:^[0-9]+$)'
> > >
> > > The latter looks like there are two walsenders active, which confuses the
> > > test.
> >
> > Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the
> > buildfarm default, and I haven't fixed that animals configuration...
> >
> > %c apparently is hex(process startup time).hex(pid) in hex, so we're looking
> > for 382f58... Confirmed by the slot name:
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
> > which pg_basebackup builds using the backend pid:
> > replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn));
> >
> > The logs for that pid are:
> > 2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local]
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
> > 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL ( RESERVE_WAL)
> > 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM
> > 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM
> > 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
> > 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE 1
> >
> > Even though the node has log_disconnect = true, and other processes indeed log
> > their disconnection, there's no disconnect for the above session until the
> > server is shut down. Even though pg_basebackup clearly finished? Uh, huh?
>
> It seems to me so, too.
>
> > I guess it's conceivable that the backend was still working through process
> > shutdown? But it doesn't seem too likely, given that several other connections
> > manage to get through entire connect / disconnect cycles?
>
> Yes, but since postmaster seems thinking that process is gone.

Or it's possible that the process took a time to clean up the
temporary replication slot?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-02-16 06:01:19 Re: Race conditions in 019_replslot_limit.pl
Previous Message Justin Pryzby 2022-02-16 05:57:56 Re: make tuplestore helper function