Re: Logical replication existing data copy

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-28 19:36:44
Message-ID: 9f9b45f3de725a1a2f3e19090f1c3e1a@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-02-28 07:38, Erik Rijkers wrote:
> On 2017-02-27 15:08, Petr Jelinek wrote:

> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> +
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> +
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> +
> snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> +
> snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
> snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
> +
> snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> +
> snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch
> +
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch

This is the most frequent error that happens while doing pgbench-runs
over logical replication: I run it continuously all day, and every few
hours an error occurs of the kind seen below: a table (pgbench_history,
mostly) ends up 1 row short (673466 instead of 673467). I have the
script wait a long time before calling it an error (because in theory it
could still 'finish', and end successfully (although that has not
happened yet, once the system got into this state).

-- pgbench -c 16 -j 8 -T 120 -P 24 -n -M simple -- scale 25

[...]
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708
f952814c3 559d618cd master
6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708
f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s (total 20 s) (unchanged 1)
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708
f952814c3 559d618cd master
6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708
f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s (total 25 s) (unchanged 2)
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708
f952814c3 559d618cd master
6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708
f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s (total 30 s) (unchanged 3)
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708
f952814c3 559d618cd master
6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708
f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s (total 35 s) (unchanged 4)

I gathered some info in this (proabably deadlocked) state in the hope
there is something suspicious in there:

UID PID PPID C STIME TTY STAT TIME CMD
rijkers 71203 1 0 20:06 pts/57 S 0:00 postgres -D
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/data -p
6973 -c wal_level=replica [...]
rijkers 71214 71203 0 20:06 ? Ss 0:00 \_ postgres:
logger process
rijkers 71216 71203 0 20:06 ? Ss 0:00 \_ postgres:
checkpointer process
rijkers 71217 71203 0 20:06 ? Ss 0:00 \_ postgres:
writer process
rijkers 71218 71203 0 20:06 ? Ss 0:00 \_ postgres: wal
writer process
rijkers 71219 71203 0 20:06 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 71220 71203 0 20:06 ? Ss 0:00 \_ postgres: stats
collector process
rijkers 71221 71203 0 20:06 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 71222 71203 0 20:06 ? Ss 0:00 \_ postgres:
bgworker: logical replication worker 30042

rijkers 71201 1 0 20:06 pts/57 S 0:00 postgres -D
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/data -p
6972 -c wal_level=logical [...]
rijkers 71206 71201 0 20:06 ? Ss 0:00 \_ postgres:
logger process
rijkers 71208 71201 0 20:06 ? Ss 0:00 \_ postgres:
checkpointer process
rijkers 71209 71201 0 20:06 ? Ss 0:00 \_ postgres:
writer process
rijkers 71210 71201 0 20:06 ? Ss 0:00 \_ postgres: wal
writer process
rijkers 71211 71201 0 20:06 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 71212 71201 0 20:06 ? Ss 0:00 \_ postgres: stats
collector process
rijkers 71213 71201 0 20:06 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 71223 71201 0 20:06 ? Ss 0:00 \_ postgres: wal
sender process rijkers [local] idle

-- replica:
port | shared_buffers | work_mem | m_w_m | e_c_s
------+----------------+----------+-------+-------
6973 | 100MB | 50MB | 2GB | 64GB
(1 row)

select current_setting('port') as port
, datname as db
, to_char(pg_database_size(datname), '9G999G999G999G999')
|| ' (' || pg_size_pretty(pg_database_size(datname)) || ')' as
dbsize
, pid
, application_name as app
, xact_start
, query_start
, regexp_replace( cast(now() - query_start as text),
E'\.[[:digit:]]+\$', '') as duration
, state
, wait_event_type as wt_evt_type
, wait_event as wt_evt
from pg_stat_activity
where query !~ 'pg_stat_activity'
-- and query != '<IDLE>'
port | db | dbsize | pid | app
| xact_start | query_start | duration | state |
wt_evt_type | wt_evt
------+--------+-----------------------------+-------+----------------------------------+------------+-------------+----------+-------+-------------+------------------
6973 | testdb | 817,243,955 (779 MB) | 71222 | logical
replication worker 30042 | | | | idle |
Activity | LogicalApplyMain
(1 row)

select * from pg_locks;
locktype | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction | pid
| mode | granted | fastpath
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+-----------------+---------+----------
relation | 16384 | 11547 | | | |
| | | | 4/568 | 77713 |
AccessShareLock | t | t
virtualxid | | | | | 4/568 |
| | | | 4/568 | 77713 |
ExclusiveLock | t | t
(2 rows)

select * from pg_stat_subscription;
subid | subname | pid | relid | received_lsn |
last_msg_send_time | last_msg_receipt_time |
latest_end_lsn | latest_end_time
-------+---------+-------+-------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------
30042 | sub1 | 71222 | | 29/6C91C980 | 2017-02-28
20:11:50.818774+01 | 2017-02-28 20:11:50.818845+01 | 29/6C91C980 |
2017-02-28 20:11:50.818774+01
(1 row)

select * from pg_subscription;
subdbid | subname | subowner | subenabled | subconninfo | subslotname |
subpublications
---------+---------+----------+------------+-------------+-------------+-----------------
16384 | sub1 | 10 | t | port=6972 | sub1 |
{pub1}
(1 row)

select * from pg_subscription_rel;
srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+----------
30042 | 30022 | r |
30042 | 30019 | r |
30042 | 30030 | r |
30042 | 30025 | r |
(4 rows)

-- master
\c - - - 6972
select pid , usesysid /* , usename */ , application_name /* ,
client_addr */ /* , client_hostname */ /* , client_port */
, backend_start , backend_xmin , state , sent_location ,
write_location , flush_location
, replay_location , sync_priority , sync_state from
pg_stat_replication;
pid | usesysid | application_name | backend_start |
backend_xmin | state | sent_location | write_location |
flush_location | replay_location | sync_priority | sync_state
-------+----------+------------------+-----------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
71223 | 10 | sub1 | 2017-02-28 20:06:50.3357+01 |
| streaming | 29/88BC9BB0 | 29/88BC9BB0 | 29/88BC9BB0 |
29/88BC9BB0 | 0 | async
(1 row)

Erik Rijkers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-02-28 19:55:40 Re: SQL/JSON in PostgreSQL
Previous Message Peter Geoghegan 2017-02-28 19:28:15 Cost model for parallel CREATE INDEX