Re: Replication slot stats misgivings

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: Replication slot stats misgivings
Date: 2021-04-27 14:58:07
Message-ID: CAD21AoD8DvPDFKtZJOV1YkeTeemAREQ+ppuQL7U3LJcqFi1V-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 27, 2021 at 11:29 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Apr 27, 2021 at 5:40 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Apr 27, 2021 at 8:58 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > I have pushed this patch and seeing one buildfarm failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-27%2009%3A23%3A14
> >
> > starting permutation: s1_init s1_begin s1_insert_tbl1 s1_insert_tbl2
> > s2_alter_tbl1_char s1_commit s2_get_changes
> > + isolationtester: canceling step s1_init after 314 seconds
> > step s1_init: SELECT 'init' FROM
> > pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
> > ?column?
> >
> > I am analyzing this.
> >
>
> After checking below logs corresponding to this test, it seems test
> has been executed and create_slot was successful:

The pg_create_logical_replication_slot() was executed at 11:04:25:

2021-04-27 11:04:25.494 UTC [17694956:49] isolation/concurrent_ddl_dml
LOG: statement: SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot', 'test_decoding');

Therefore this command took 314 sec that matches the number the
isolation test reported. And the folling logs follow:

2021-04-27 11:06:43.770 UTC [17694956:50] isolation/concurrent_ddl_dml
LOG: logical decoding found consistent point at 0/17F9078
2021-04-27 11:06:43.770 UTC [17694956:51] isolation/concurrent_ddl_dml
DETAIL: There are no running transactions.

> 2021-04-27 11:06:43.770 UTC [17694956:52] isolation/concurrent_ddl_dml
> STATEMENT: SELECT 'init' FROM
> pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
> 2021-04-27 11:07:11.748 UTC [5243096:9] LOG: checkpoint starting: time
> 2021-04-27 11:09:24.332 UTC [5243096:10] LOG: checkpoint complete:
> wrote 14 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=0.716 s, sync=0.001 s, total=132.584 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=198 kB, estimate=406 kB
> 2021-04-27 11:09:40.116 UTC [6226046:1] [unknown] LOG: connection
> received: host=[local]
> 2021-04-27 11:09:40.117 UTC [17694956:53] isolation/concurrent_ddl_dml
> LOG: statement: BEGIN;
> 2021-04-27 11:09:40.117 UTC [17694956:54] isolation/concurrent_ddl_dml
> LOG: statement: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
> 2021-04-27 11:09:40.118 UTC [17694956:55] isolation/concurrent_ddl_dml
> LOG: statement: INSERT INTO tbl2 (val1, val2) VALUES (1, 1);
> 2021-04-27 11:09:40.119 UTC [10944636:49] isolation/concurrent_ddl_dml
> LOG: statement: ALTER TABLE tbl1 ALTER COLUMN val2 TYPE character
> varying;
>
> I am not sure but there is some possibility that even though create
> slot is successful, the isolation tester got successful in canceling
> it, maybe because create_slot is just finished at the same time.

Yeah, we see the test log "canceling step s1_init after 314 seconds"
but don't see any log indicating canceling query.

> As we
> can see from logs, during this test checkpoint also happened which
> could also lead to the slowness of this particular command.

Yes. I also think the checkpoint could somewhat lead to the slowness.
And since create_slot() took 2min to find a consistent snapshot the
system might have already been busy.

>
> Also, I see a lot of messages like below which indicate stats
> collector is also quite slow:
> 2021-04-27 10:57:59.385 UTC [18743536:1] LOG: using stale statistics
> instead of current ones because stats collector is not responding
>
> I am not sure if the timeout happened because the machine is slow or
> is it in any way related to code. I am seeing some previous failures
> due to timeout on this machine [1][2]. In those failures, I see the
> "using stale stats...." message.

It seems like a time-dependent issue but I'm wondering why the logical
decoding test failed at this time.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-04-27 15:26:02 Re: Performance degradation of REFRESH MATERIALIZED VIEW
Previous Message Tom Lane 2021-04-27 14:38:23 Re: Better sanity checking for message length words