Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-03-15 05:14:55
Message-ID: CALj2ACVtLt0rPnr4NS6c72DdtQmpyQA=qcMaVfn6vu7j5yQvzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 13, 2024 at 9:38 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> BTW, is XID the based parameter 'max_slot_xid_age' not have similarity
> with 'max_slot_wal_keep_size'? I think it will impact the rows we
> removed based on xid horizons. Don't we need to consider it while
> vacuum computing the xid horizons in ComputeXidHorizons() similar to
> what we do for WAL w.r.t 'max_slot_wal_keep_size'?

I'm having a hard time understanding why we'd need something up there
in ComputeXidHorizons(). Can you elaborate it a bit please?

What's proposed with max_slot_xid_age is that during checkpoint we
look at slot's xmin and catalog_xmin, and the current system txn id.
Then, if the XID age of (xmin, catalog_xmin) and current_xid crosses
max_slot_xid_age, we invalidate the slot. Let me illustrate how all
this works:

1. Setup a primary and standby with hot_standby_feedback set to on on
standby. For instance, check my scripts at [1].

2. Stop the standby to make the slot inactive on the primary. Check
the slot is holding xmin of 738.
./pg_ctl -D sbdata -l logfilesbdata stop

postgres=# SELECT * FROM pg_replication_slots;
-[ RECORD 1 ]-------+-------------
slot_name | sb_repl_slot
plugin |
slot_type | physical
datoid |
database |
temporary | f
active | f
active_pid |
xmin | 738
catalog_xmin |
restart_lsn | 0/3000000
confirmed_flush_lsn |
wal_status | reserved
safe_wal_size |
two_phase | f
conflict_reason |
failover | f
synced | f

3. Start consuming the XIDs on the primary with the following script
for instance
./psql -d postgres -p 5432
DROP TABLE tab_int;
CREATE TABLE tab_int (a int);

do $$
begin
for i in 1..268435 loop
-- use an exception block so that each iteration eats an XID
begin
insert into tab_int values (i);
exception
when division_by_zero then null;
end;
end loop;
end$$;

4. Make some dead rows in the table.
update tab_int set a = a+1;
delete from tab_int where a%4=0;

postgres=# SELECT n_dead_tup, n_tup_ins, n_tup_upd, n_tup_del FROM
pg_stat_user_tables WHERE relname = 'tab_int';
-[ RECORD 1 ]------
n_dead_tup | 335544
n_tup_ins | 268435
n_tup_upd | 268435
n_tup_del | 67109

5. Try vacuuming to delete the dead rows, observe 'tuples: 0 removed,
536870 remain, 335544 are dead but not yet removable'. The dead rows
can't be removed because the inactive slot is holding an xmin, see
'removable cutoff: 738, which was 268441 XIDs old when operation
ended'.

postgres=# vacuum verbose tab_int;
INFO: vacuuming "postgres.public.tab_int"
INFO: finished vacuuming "postgres.public.tab_int": index scans: 0
pages: 0 removed, 2376 remain, 2376 scanned (100.00% of total)
tuples: 0 removed, 536870 remain, 335544 are dead but not yet removable
removable cutoff: 738, which was 268441 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 4759 hits, 0 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.07 s, system: 0.00 s, elapsed: 0.07 s
VACUUM

6. Now, repeat the above steps but with setting max_slot_xid_age =
200000 on the primary.

7. Do a checkpoint to invalidate the slot.
postgres=# checkpoint;
CHECKPOINT
postgres=# SELECT * FROM pg_replication_slots;
-[ RECORD 1 ]-------+-------------
slot_name | sb_repl_slot
plugin |
slot_type | physical
datoid |
database |
temporary | f
active | f
active_pid |
xmin | 738
catalog_xmin |
restart_lsn | 0/3000000
confirmed_flush_lsn |
wal_status | lost
safe_wal_size |
two_phase | f
conflicting |
failover | f
synced | f
invalidation_reason | xid_aged

8. And, then vacuum the table, observe 'tuples: 335544 removed, 201326
remain, 0 are dead but not yet removable'.

postgres=# vacuum verbose tab_int;
INFO: vacuuming "postgres.public.tab_int"
INFO: finished vacuuming "postgres.public.tab_int": index scans: 0
pages: 0 removed, 2376 remain, 2376 scanned (100.00% of total)
tuples: 335544 removed, 201326 remain, 0 are dead but not yet removable
removable cutoff: 269179, which was 0 XIDs old when operation ended
new relfrozenxid: 269179, which is 268441 XIDs ahead of previous value
frozen: 1189 pages from table (50.04% of total) had 201326 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 193.100 MB/s
buffer usage: 4760 hits, 0 misses, 2381 dirtied
WAL usage: 5942 records, 2378 full page images, 8343275 bytes
system usage: CPU: user: 0.09 s, system: 0.00 s, elapsed: 0.09 s
VACUUM

[1]
cd /home/ubuntu/postgres/pg17/bin
./pg_ctl -D db17 -l logfile17 stop
rm -rf db17 logfile17
rm -rf /home/ubuntu/postgres/pg17/bin/archived_wal
mkdir /home/ubuntu/postgres/pg17/bin/archived_wal

./initdb -D db17
echo "archive_mode = on
archive_command='cp %p
/home/ubuntu/postgres/pg17/bin/archived_wal/%f'" | tee -a
db17/postgresql.conf

./pg_ctl -D db17 -l logfile17 start
./psql -d postgres -p 5432 -c "SELECT
pg_create_physical_replication_slot('sb_repl_slot', true, false);"

rm -rf sbdata logfilesbdata
./pg_basebackup -D sbdata
echo "port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu'
primary_slot_name='sb_repl_slot'
restore_command='cp /home/ubuntu/postgres/pg17/bin/archived_wal/%f %p'
hot_standby_feedback = on" | tee -a sbdata/postgresql.conf

touch sbdata/standby.signal

./pg_ctl -D sbdata -l logfilesbdata start
./psql -d postgres -p 5433 -c "SELECT pg_is_in_recovery();"

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2024-03-15 05:23:15 Re: Add basic tests for the low-level backup method.
Previous Message Andrei Lepikhov 2024-03-15 05:05:02 Re: POC, WIP: OR-clause support for indexes