Re: Directory pg_replslot is not properly cleaned

From: Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, d(dot)sarafannikov(at)bk(dot)ru
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Directory pg_replslot is not properly cleaned
Date: 2017-06-07 18:29:28
Message-ID: CAFcNs+qwUwjPL3+QqRBAxaa6BuQ0Y5+pk03NNuOdHLkXeWd6DQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 2, 2017 at 6:37 PM, Fabrízio de Royes Mello <
fabriziomello(at)gmail(dot)com> wrote:
>
>
> On Fri, Jun 2, 2017 at 6:32 PM, Fabrízio de Royes Mello <
fabriziomello(at)gmail(dot)com> wrote:
> >
> > Hi all,
> >
> > This week I faced a out of disk space trouble in 8TB production
cluster. During investigation we notice that pg_replslot was the culprit
growing more than 1TB in less than 1 (one) hour.
> >
> > We're using PostgreSQL 9.5.6 with pglogical 1.2.2 replicating to a new
9.6 instance and planning the upgrade soon.
> >
> > What I did? I freed some disk space just to startup PostgreSQL and
begin the investigation. During the 'startup recovery' simply the files
inside the pg_replslot was tottaly removed. So our trouble with 'out of
disk space' disappear. Then the server went up and physical slaves attached
normally to master but logical slaves doesn't, staying stalled in 'catchup'
state.
> >
> > At this moment the "pg_replslot" directory started growing fast again
and forced us to drop the logical replication slot and we lost the logical
slave.
> >
> > Googling awhile I found this thread [1] about a similar issue reported
by Dmitriy Sarafannikov and replied by Andres and Álvaro.
> >
> > I ran the test case provided by Dmitriy [1] against branches:
> > - REL9_4_STABLE
> > - REL9_5_STABLE
> > - REL9_6_STABLE
> > - master
> >
> > After all test the issue remains... and also using the new Logical
Replication stuff (CREATE PUB/CREATE SUB). Just after a restart the
"pg_replslot" was properly cleaned. The typo in ReorderBufferIterTXNInit
complained by Dimitriy was fixed but the issue remains.
> >
> > Seems no one complain again about this issue and the thread was lost.
> >
> > The attached is a reworked version of Dimitriy's patch that seems solve
the issue. I confess I don't know enough about replication slots code to
really know if it's the best solution.
> >
> > Regards,
> >
> > [1]
https://www.postgresql.org/message-id/1457621358.355011041%40f382.i.mail.ru
> >
>
> Just adding Dimitriy to conversation... previous email I provided was
wrong.
>

Does anyone have some thought about this critical issue?

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-06-07 18:30:23 Re: Directory pg_replslot is not properly cleaned
Previous Message Mike Palmiotto 2017-06-07 18:11:19 Re: BUG #14682: row level security not work with partitioned table