Re: logical replication - still unstable after all these months

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: logical replication - still unstable after all these months
Date: 2017-05-27 02:43:05
Message-ID: 9d592410c042fedfa6dc10e19adf7180@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-05-27 01:35, Mark Kirkwood wrote:
> On 26/05/17 20:09, Erik Rijkers wrote:
>>
>> this whole thing 100x
>
> Some questions that might help me get it right:
> - do you think we need to stop and start the instances every time?
> - do we need to init pgbench each time?
> - could we just drop the subscription and publication and truncate the
> replica tables instead?

I have done all that in earler versions.

I deliberately added these 'complications' in view of the intractability
of the problem: my fear is that an earlier failure leaves some
half-failed state behind in an instance, which then might cause more
failure. This would undermine the intent of the whole exercise (which
is to count succes/failure rate). So it is important to be as sure as
possible that each cycle starts out as cleanly as possible.

> - what scale pgbench are you running?

I use a small script to call the main script; at the moment it does
something like:
-------------------
duration=60
from=1
to=100
for scale in 25 5
do
for clients in 90 64 8
do
date_str=$(date +"%Y%m%d_%H%M")
outfile=out_${date_str}.txt
time for x in `seq $from $to`
do
./pgbench_derail2.sh $scale $clients $duration $date_str
[...]
-------------------

> - how many clients for the 1 min pgbench run?

see above

> - are you starting the pgbench run while the copy_data jobs for the
> subscription are still running?

I assume with copy_data you mean the data sync of the original table
before pgbench starts.
And yes, I think here might be the origin of the problem.
( I think the problem I get is actually easily avoided by putting wait
states here and there in between separate steps. But the testing idea
here is to force the system into error, not to avoid any errors)

> - how exactly are you calculating those md5's?

Here is the bash function: cb (I forget what that stands for, I guess
'content bench'). $outf is a log file to which the program writes
output:

---------------------------
function cb()
{
# display the 4 pgbench tables' accumulated content as md5s
# a,b,t,h stand for: pgbench_accounts, -branches, -tellers, -history
num_tables=$( echo "select count(*) from pg_class where relkind = 'r'
and relname ~ '^pgbench_'" | psql -qtAX )
if [[ $num_tables -ne 4 ]]
then
echo "pgbench tables not 4 - exit" >> $outf
exit
fi
for port in $port1 $port2
do
md5_a=$(echo "select * from pgbench_accounts order by aid"|psql
-qtAXp $port|md5sum|cut -b 1-9)
md5_b=$(echo "select * from pgbench_branches order by bid"|psql
-qtAXp $port|md5sum|cut -b 1-9)
md5_t=$(echo "select * from pgbench_tellers order by tid"|psql
-qtAXp $port|md5sum|cut -b 1-9)
md5_h=$(echo "select * from pgbench_history order by hid"|psql
-qtAXp $port|md5sum|cut -b 1-9)
cnt_a=$(echo "select count(*) from pgbench_accounts" |psql
-qtAXp $port)
cnt_b=$(echo "select count(*) from pgbench_branches" |psql
-qtAXp $port)
cnt_t=$(echo "select count(*) from pgbench_tellers" |psql
-qtAXp $port)
cnt_h=$(echo "select count(*) from pgbench_history" |psql
-qtAXp $port)
md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" |
md5sum )
printf "$port a,b,t,h: %8d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h
echo -n " $md5_a $md5_b $md5_t $md5_h"
if [[ $port -eq $port1 ]]; then echo " master"
elif [[ $port -eq $port2 ]]; then echo -n " replica"
else echo " ERROR "
fi
done
if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}" ]]
then
echo " ok"
else
echo " NOK"
fi
}
---------------------------

this enables:

echo "-- getting md5 (cb)"
cb_text1=$(cb)

and testing that string like:

if echo "$cb_text1" | grep -qw 'replica ok';
then
echo "-- All is well."

[...]

Later today I'll try to clean up the whole thing and post it.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2017-05-27 08:30:23 Re: logical replication - still unstable after all these months
Previous Message Euler Taveira 2017-05-27 02:00:46 Re: ALTER SUBSCRIPTION ..SET PUBLICATION <no name> refresh is not throwing error.