Re: snapbuild woes

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Subject: Re: snapbuild woes
Date: 2017-05-09 17:54:37
Message-ID: 84a48a73304ed5363433ec4b9fbb4438@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-05-09 11:50, Petr Jelinek wrote:

> I rebased the above mentioned patch to apply to the patches Andres
> sent,
> if you could try to add it on top of what you have and check if it
> still
> fails, that would be helpful.

It still fails.

With these patches

- 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
- 00002-WIP-Possibly-more-robust-snapbuild-approach.patch +
- fix-statistics-reporting-in-logical-replication-work.patch +
- Skip-unnecessary-snapshot-builds.patch

built again on top of 44c528810a1 ( so I had to add the
'fix-statistics-rep*' patch because without it I immediately got that
Assertion failure again ).

As always most runs succeed (especially on this large 192GB 16-core
server).

But attached is an output file of a number of runs of my
pgbench_derail2.sh test.

Overal result:

-- out_20170509_1635.txt
3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n -- scale 25
2 -- All is well.
1 -- Not good, but breaking out of wait (21 times no change)

I broke it off after iteration 4, so 5 never ran, and
iteration 1 failed due to a mistake in the harness (somethind stupid I
did) - not interesting.

iteration 2 succeeds. (eventually has 'replica ok')

iteration 3 succeeds. (eventually has 'replica ok')

iteration 4 fails.
Just after 'alter subscription sub1 enable' I caught (as is usual)
pg_stat_replication.state as 'catchup'. So far so good.
After the 15-minute pgbench run pg_stat_replication has only 2
'startup' lines (and none 'catchup' or 'streaming'):

port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 108349 | 19/8FBCC248 | | |
| startup | derail2 | async
6972 | pg_stat_replication | 108351 | 19/8FBCC248 | | |
| startup | derail2 | async

(that's from:
select $port1 as port,'pg_stat_replication' as pg_stat_replication,
pid
, pg_current_wal_location() wal, replay_location replay_loc,
pg_current_wal_location() - replay_location as diff
, pg_current_wal_location() <= replay_location
, state, application_name as app, sync_state
from pg_stat_replication
)

This remains in this state for as long as my test-programs lets it
(i.e., 20 x 30s, or something like that, and then the loop is exited);
in the ouput file it says: 'Not good, but breaking out of wait'

Below is the accompanying ps (with the 2 'deranged senders' as Jeff
Janes would surely call them):

UID PID PPID C STIME TTY STAT TIME CMD
rijkers 107147 1 0 17:11 pts/35 S+ 0:00
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres
-D /var/data1/pg_stuff/pg_installations
rijkers 107149 107147 0 17:11 ? Ss 0:00 \_ postgres:
logger process
rijkers 107299 107147 0 17:11 ? Ss 0:01 \_ postgres:
checkpointer process
rijkers 107300 107147 0 17:11 ? Ss 0:00 \_ postgres:
writer process
rijkers 107301 107147 0 17:11 ? Ss 0:00 \_ postgres: wal
writer process
rijkers 107302 107147 0 17:11 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 107303 107147 0 17:11 ? Ss 0:00 \_ postgres: stats
collector process
rijkers 107304 107147 0 17:11 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 108348 107147 0 17:12 ? Ss 0:01 \_ postgres:
bgworker: logical replication worker for subscription 70310 sync 70293
rijkers 108350 107147 0 17:12 ? Ss 0:00 \_ postgres:
bgworker: logical replication worker for subscription 70310 sync 70298
rijkers 107145 1 0 17:11 pts/35 S+ 0:02
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres
-D /var/data1/pg_stuff/pg_installations
rijkers 107151 107145 0 17:11 ? Ss 0:00 \_ postgres:
logger process
rijkers 107160 107145 0 17:11 ? Ss 0:08 \_ postgres:
checkpointer process
rijkers 107161 107145 0 17:11 ? Ss 0:07 \_ postgres:
writer process
rijkers 107162 107145 0 17:11 ? Ss 0:02 \_ postgres: wal
writer process
rijkers 107163 107145 0 17:11 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 107164 107145 0 17:11 ? Ss 0:02 \_ postgres: stats
collector process
rijkers 107165 107145 0 17:11 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 108349 107145 0 17:12 ? Ss 0:27 \_ postgres: wal
sender process rijkers [local] idle
rijkers 108351 107145 0 17:12 ? Ss 0:26 \_ postgres: wal
sender process rijkers [local] idle

I have had no time to add (or view) any CPUinfo.

Erik Rijkers

Attachment Content-Type Size
out_20170509_1635.txt application/x-elc 15.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-05-09 18:01:35 Re: proposal psql \gdesc
Previous Message Masahiko Sawada 2017-05-09 17:46:50 Re: Get stuck when dropping a subscription during synchronizing table