logical replication - possible remaining problem

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: pgsql-hackers(at)postgresql(dot)org
Subject: logical replication - possible remaining problem
Date: 2017-06-07 20:49:25
Message-ID: 3d9f3a58b9dd3cd9783f66dd19d1c309@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I am not sure whether what I found here amounts to a bug, I might be
doing something dumb.

During the last few months I did tests by running pgbench over logical
replication. Earlier emails have details.

The basic form of that now works well (and the fix has been comitted)
but as I looked over my testing program I noticed one change I made to
it, already many weeks ago:

In the cleanup during startup (pre-flight check you might say) and also
before the end, instead of

echo "delete from pg_subscription;" | psql -qXp $port2 -- (1)

I changed that (as I say, many weeks ago) to:

echo "delete from pg_subscription;
delete from pg_subscription_rel;
delete from pg_replication_origin; " | psql -qXp $port2 -- (2)

This occurs (2x) inside the bash function clean_pubsub(), in main test
script pgbench_detail2.sh

This change was an effort to ensure to arrive at a 'clean' start (and
end-) state which would always be the same.

All my more recent testing (and that of Mark, I have to assume) was thus
done with (2).

Now, looking at the script again I am thinking that it would be
reasonable to expect that after issuing
delete from pg_subscription;

the other 2 tables are /also/ cleaned, automatically, as a consequence.
(Is this reasonable? this is really the main question of this email).

So I removed the latter two delete statements again, and ran the tests
again with the form in (1)

I have established that (after a number of successful cycles) the test
stops succeeding with in the replica log repetitions of:

2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply
worker for subscription "sub1" has started
2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free
replication state slot for replication origin with OID 11
2017-06-07 22:10:29.057 CEST [2421] HINT: Increase
max_replication_slots and try again.
2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical
replication worker for subscription 29235 (PID 2421) exited with exit
code 1

when I manually 'clean up' by doing:
delete from pg_replication_origin;

then, and only then, does the session finish and succeed ('replica ok').

So to me it looks as if there is an omission of
pg_replication_origin-cleanup when pg_description is deleted.

Does that make sense? All this is probably vague and I am only posting
in the hope that Petr (or someone else) perhaps immediately understands
what goes wrong, with even his limited amount of info.

In the meantime I will try to dig up more detailed info...

thanks,

Erik Rijkers

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2017-06-07 21:00:59 Re: Race conditions with WAL sender PID lookups
Previous Message Peter Eisentraut 2017-06-07 20:36:56 Re: Re: Alter subscription..SET - NOTICE message is coming for table which is already removed