Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

From: Ivan Kartyshov <i(dot)kartyshov(at)postgrespro(dot)ru>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Date: 2018-08-09 20:25:59
Message-ID: 4ae82b354af9a20624e1e1d05932087e@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, thank you for your review.

Alexander Korotkov писал 2018-06-20 20:54:
> Thinking about that more I found that adding vacuum mark as an extra
> argument to LockAcquireExtended is also wrong. It would be still hard
> to determine if we should log the lock in LogStandbySnapshot(). We'll
> have to add that flag to shared memory table of locks. And that looks
> like a kludge.
>
> Therefore, I'd like to propose another approach: introduce new lock
> level. So, AccessExclusiveLock will be split into two
> AccessExclusiveLocalLock and AccessExclusiveLock. In spite of
> AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to
> standby, and used for heap truncation.
>
> I expect some resistance to my proposal, because fixing this "small
> bug" doesn't deserve new lock level. But current behavior of
> hot_standby_feedback is buggy. From user prospective,
> hot_standby_feedback option is just non-worker, which causes master to
> bloat without protection for standby queries from cancel. We need to
> fix that, but I don't see other proper way to do that except adding
> new lock level...

Your offer is very interesting, it made patch smaller and more
beautiful.
So I update patch and made changes accordance with the proposed concept
of
special AccessExclusiveLocalLock.

> I don't yet understand what is the problem here and why this patch
> should solve that. As I get idea of this patch, GetOldestXmin() will
> initialize xmin of walsender with lowest xmin of replication slot.
> But xmin of replication slots will be anyway taken into account by
> GetSnapshotData() called by vacuum.
>
>> How to test:
>> Make async replica, turn on feedback, reduce
>> max_standby_streaming_delay
>> and aggressive autovacuum.
>
> You forgot to mention, that one should setup the replication using
> replication slot. Naturally, if replication slot isn't exist, then
> master shouldn't keep dead tuples for disconnected standby. Because
> master doesn't know if standby will reconnect or is it gone forever.

I tried to solve hot_standby_feedback without replication slots,
so I found a little window of possibility of case, when vacuum on
master make heap truncation of pages that standby needs for just started
transaction.

How to test:
Make async replica, turn on feedback and reduce
max_standby_streaming_delay.
Make autovacuum more aggressive.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0

Test1:
Here we will do a load on the master and simulation of a long
transaction with repeated 1 second SEQSCANS on the replica (by calling
pg_sleep 1 second duration every 6 seconds).
MASTER REPLICA
hot_standby = on
max_standby_streaming_delay = 1s
hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
start
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT pg_sleep(value) FROM test;
\watch 6

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

Only some times Error occurs because this tests is too synthetic
ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
Because of rising ResolveRecoveryConflictWithSnapshot while
redo some visibility flags to avoid this conflict we can do test2 or
increase max_standby_streaming_delay.

Test2:
Here we will do a load on the master and simulation of a long
transaction on the replica (by taking LOCK on table)
MASTER REPLICA
hot_standby = on
max_standby_streaming_delay = 1s
hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1)
id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
start
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
LOCK TABLE test IN ACCESS SHARE MODE;
select * from test;
\watch 6

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

I would like to here you opinion over this implementation.

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
hsfeedback_locallock.patch text/x-diff 20.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Larry Rosenman 2018-08-09 20:26:17 Re: peripatus build failures....
Previous Message Bruce Momjian 2018-08-09 20:25:09 Re: POC for a function trust mechanism