Re: snapbuild woes

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Subject: Re: snapbuild woes
Date: 2017-05-09 19:00:05
Message-ID: 7d3b9bb1-41a9-1f9a-43ca-c4bb271333a1@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/05/17 19:54, Erik Rijkers wrote:
> On 2017-05-09 11:50, Petr Jelinek wrote:
>
>> I rebased the above mentioned patch to apply to the patches Andres sent,
>> if you could try to add it on top of what you have and check if it still
>> fails, that would be helpful.
>
> It still fails.
>
> With these patches
>
> - 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
> - 00002-WIP-Possibly-more-robust-snapbuild-approach.patch +
> - fix-statistics-reporting-in-logical-replication-work.patch +
> - Skip-unnecessary-snapshot-builds.patch
>
> built again on top of 44c528810a1 ( so I had to add the
> 'fix-statistics-rep*' patch because without it I immediately got that
> Assertion failure again ).
>
> As always most runs succeed (especially on this large 192GB 16-core
> server).
>
> But attached is an output file of a number of runs of my
> pgbench_derail2.sh test.
>
> Overal result:
>
> -- out_20170509_1635.txt
> 3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n -- scale 25
> 2 -- All is well.
> 1 -- Not good, but breaking out of wait (21 times no change)
>
> I broke it off after iteration 4, so 5 never ran, and
> iteration 1 failed due to a mistake in the harness (somethind stupid I
> did) - not interesting.
>
> iteration 2 succeeds. (eventually has 'replica ok')
>
> iteration 3 succeeds. (eventually has 'replica ok')
>
> iteration 4 fails.
> Just after 'alter subscription sub1 enable' I caught (as is usual)
> pg_stat_replication.state as 'catchup'. So far so good.
> After the 15-minute pgbench run pg_stat_replication has only 2
> 'startup' lines (and none 'catchup' or 'streaming'):
>
> port | pg_stat_replication | pid | wal | replay_loc | diff |
> ?column? | state | app | sync_state
> 6972 | pg_stat_replication | 108349 | 19/8FBCC248 | | |
> | startup | derail2 | async
> 6972 | pg_stat_replication | 108351 | 19/8FBCC248 | | |
> | startup | derail2 | async
>
> (that's from:
> select $port1 as port,'pg_stat_replication' as pg_stat_replication, pid
> , pg_current_wal_location() wal, replay_location replay_loc,
> pg_current_wal_location() - replay_location as diff
> , pg_current_wal_location() <= replay_location
> , state, application_name as app, sync_state
> from pg_stat_replication
> )
>
> This remains in this state for as long as my test-programs lets it
> (i.e., 20 x 30s, or something like that, and then the loop is exited);
> in the ouput file it says: 'Not good, but breaking out of wait'
>
> Below is the accompanying ps (with the 2 'deranged senders' as Jeff
> Janes would surely call them):
>
>
> UID PID PPID C STIME TTY STAT TIME CMD
> rijkers 107147 1 0 17:11 pts/35 S+ 0:00
> /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres
> -D /var/data1/pg_stuff/pg_installations
> rijkers 107149 107147 0 17:11 ? Ss 0:00 \_ postgres:
> logger process
> rijkers 107299 107147 0 17:11 ? Ss 0:01 \_ postgres:
> checkpointer process
> rijkers 107300 107147 0 17:11 ? Ss 0:00 \_ postgres:
> writer process
> rijkers 107301 107147 0 17:11 ? Ss 0:00 \_ postgres: wal
> writer process
> rijkers 107302 107147 0 17:11 ? Ss 0:00 \_ postgres:
> autovacuum launcher process
> rijkers 107303 107147 0 17:11 ? Ss 0:00 \_ postgres: stats
> collector process
> rijkers 107304 107147 0 17:11 ? Ss 0:00 \_ postgres:
> bgworker: logical replication launcher
> rijkers 108348 107147 0 17:12 ? Ss 0:01 \_ postgres:
> bgworker: logical replication worker for subscription 70310 sync 70293
> rijkers 108350 107147 0 17:12 ? Ss 0:00 \_ postgres:
> bgworker: logical replication worker for subscription 70310 sync 70298
> rijkers 107145 1 0 17:11 pts/35 S+ 0:02
> /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres
> -D /var/data1/pg_stuff/pg_installations
> rijkers 107151 107145 0 17:11 ? Ss 0:00 \_ postgres:
> logger process
> rijkers 107160 107145 0 17:11 ? Ss 0:08 \_ postgres:
> checkpointer process
> rijkers 107161 107145 0 17:11 ? Ss 0:07 \_ postgres:
> writer process
> rijkers 107162 107145 0 17:11 ? Ss 0:02 \_ postgres: wal
> writer process
> rijkers 107163 107145 0 17:11 ? Ss 0:00 \_ postgres:
> autovacuum launcher process
> rijkers 107164 107145 0 17:11 ? Ss 0:02 \_ postgres: stats
> collector process
> rijkers 107165 107145 0 17:11 ? Ss 0:00 \_ postgres:
> bgworker: logical replication launcher
> rijkers 108349 107145 0 17:12 ? Ss 0:27 \_ postgres: wal
> sender process rijkers [local] idle
> rijkers 108351 107145 0 17:12 ? Ss 0:26 \_ postgres: wal
> sender process rijkers [local] idle
>
> I have had no time to add (or view) any CPUinfo.
>
>

Ah okay, so this is same issue that's reported by both Masahiko Sawada
[1] and Jeff Janes [2].

[1]
https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-05-09 19:10:03 Re: proposal psql \gdesc
Previous Message Peter Eisentraut 2017-05-09 18:53:05 Re: Logical replication ApplyContext bloat