Re: Replication slot stats misgivings

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: Replication slot stats misgivings
Date: 2021-04-01 16:55:40
Message-ID: CALDaNm0NUXHszpFfUhMa565N0H-oKHSRL=BQFwwaMLGm0iUKwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 1, 2021 at 5:58 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Apr 1, 2021 at 3:43 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Wed, Mar 31, 2021 at 11:32 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Mar 30, 2021 at 11:00 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 2021-03-30 10:13:29 +0530, vignesh C wrote:
> > > > > On Tue, Mar 30, 2021 at 6:28 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > > > > Any chance you could write a tap test exercising a few of these cases?
> > > > >
> > > > > I can try to write a patch for this if nobody objects.
> > > >
> > > > Cool!
> > > >
> > >
> > > Attached a patch which has the test for the first scenario.
> > >
> > > > > > E.g. things like:
> > > > > >
> > > > > > - create a few slots, drop one of them, shut down, start up, verify
> > > > > > stats are still sane
> > > > > > - create a few slots, shut down, manually remove a slot, lower
> > > > > > max_replication_slots, start up
> > > > >
> > > > > Here by "manually remove a slot", do you mean to remove the slot
> > > > > manually from the pg_replslot folder?
> > > >
> > > > Yep - thereby allowing max_replication_slots after the shutdown/start to
> > > > be lower than the number of slots-stats objects.
> > >
> > > I have not included the 2nd test in the patch as the test fails with
> > > following warnings and also displays the statistics of the removed
> > > slot:
> > > WARNING: problem in alloc set Statistics snapshot: detected write
> > > past chunk end in block 0x55d038b8e410, chunk 0x55d038b8e438
> > > WARNING: problem in alloc set Statistics snapshot: detected write
> > > past chunk end in block 0x55d038b8e410, chunk 0x55d038b8e438
> > >
> > > This happens because the statistics file has an additional slot
> > > present even though the replication slot was removed. I felt this
> > > issue should be fixed. I will try to fix this issue and send the
> > > second test along with the fix.
> >
> > I felt from the statistics collector process, there is no way in which
> > we can identify if the replication slot is present or not because the
> > statistic collector process does not have access to shared memory.
> > Anything that the statistic collector process does independently by
> > traversing and removing the statistics of the replication slot
> > exceeding the max_replication_slot has its drawback of removing some
> > valid replication slot's statistics data.
> > Any thoughts on how we can identify the replication slot which has been dropped?
> > Can someone point me to the shared stats patch link with which message
> > loss can be avoided. I wanted to see a scenario where something like
> > the slot is dropped but the statistics are not updated because of an
> > immediate shutdown or server going down abruptly can occur or not with
> > the shared stats patch.
> >
>
> I don't think it is easy to simulate a scenario where the 'drop'
> message is dropped and I think that is why the test contains the step
> to manually remove the slot. At this stage, you can probably provide a
> test patch and a code-fix patch where it just drops the extra slots
> from the stats file. That will allow us to test it with a shared
> memory stats patch on which Andres and Horiguchi-San are working. If
> we still continue to pursue with current approach then as Andres
> suggested we might send additional information from
> RestoreSlotFromDisk to keep it in sync.

Thanks for your comments, Attached patch has the fix for the same.
Also attached a couple of more patches which addresses the comments
which Andres had listed i.e changing char to NameData type and also to
display the unspilled/unstreamed transaction information in the
replication statistics.
Thoughts?

Regards,
Vignesh

Attachment Content-Type Size
v1-0001-Changed-char-datatype-to-NameData-datatype-for-sl.patch text/x-patch 4.7 KB
v1-0002-Added-tests-for-verification-of-logical-replicati.patch text/x-patch 6.3 KB
v1-0003-Add-total-txns-and-total-txn-bytes-for-replicatio.patch text/x-patch 12.5 KB
v1-0004-Handle-overwriting-of-replication-slot-statistic-.patch text/x-patch 8.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-04-01 16:56:09 Re: pg_amcheck contrib application
Previous Message vignesh C 2021-04-01 16:49:54 Data type correction in pgstat_report_replslot function parameters