RE: [PoC] pg_upgrade: allow to upgrade publisher node

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-11-29 09:26:26
Message-ID: OS3PR01MB9882FED1F0060468FB01B9DAF583A@OS3PR01MB9882.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear hackers,

> > >
> > > Pushed!
> >
> > Hi all, the CF entry for this is marked RfC, and CI is trying to apply
> > the last patch committed. Is there further work that needs to be
> > re-attached and/or rebased?
> >
>
> No. I have marked it as committed.
>

I found another failure related with the commit [1]. I think it is caused by the
autovacuum. I want to propose a patch which disables the feature for old publisher.

More detail, please see below.

# Analysis of the failure

Summary: this failure occurs when the autovacuum starts after the subscription
is disabled but before doing pg_upgrade.

According to the regress file, it unexpectedly failed the pg_upgrade [2]. There are
no possibilities for slots are invalidated, so some WALs seemed to be generated
after disabling the subscriber.

Also, server log caused by oldpub said that autovacuum worker was terminated when
it stopped. This was occurred after walsender released the logical slots. WAL records
caused by autovacuum workers could not be consumed by the slots, so that upgrading
function returned false.

# How to reproduce

I made a small file for reproducing the failure. Please see reproduce.txt. This contains
changes for launching autovacuum worker very often and for ensuring actual works are
done. After applying it, I could reproduce the same failure every time.

# How to fix

I think it is sufficient to fix only the test code.
The easiest way is to disable the autovacuum on old publisher. PSA the patch file.

How do you think?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-11-27%2020%3A52%3A10
[2]:
```
...
Checking for contrib/isn with bigint-passing mismatch ok
Checking for valid logical replication slots fatal

Your installation contains logical replication slots that can't be upgraded.
You can remove invalid slots and/or consume the pending WAL for other slots,
and then restart the upgrade.
A list of the problematic slots is in the file:
/home/bf/bf-build/skink-master/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231127T220024.480/invalid_logical_slots.txt
Failure, exiting
[22:01:20.362](86.645s) not ok 10 - run of pg_upgrade of old cluster
...
```
[3]:
```
...
2023-11-27 22:00:23.546 UTC [3567962][walsender][4/0:0] LOG: released logical replication slot "regress_sub"
2023-11-27 22:00:23.549 UTC [3559042][postmaster][:0] LOG: received fast shutdown request
2023-11-27 22:00:23.552 UTC [3559042][postmaster][:0] LOG: aborting any active transactions
*2023-11-27 22:00:23.663 UTC [3568793][autovacuum worker][5/3:738] FATAL: terminating autovacuum process due to administrator command*
2023-11-27 22:00:23.775 UTC [3559042][postmaster][:0] LOG: background worker "logical replication launcher" (PID 3560674) exited with exit code 1
...
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
disable_autovacuum.patch application/octet-stream 578 bytes
reproduce.txt text/plain 2.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-11-29 09:32:02 Re: pg_upgrade and logical replication
Previous Message Zhijie Hou (Fujitsu) 2023-11-29 09:17:04 RE: Synchronizing slots from primary to standby