Re: Directory pg_replslot is not properly cleaned

From: Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, d(dot)sarafannikov(at)bk(dot)ru, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Directory pg_replslot is not properly cleaned
Date: 2017-06-07 18:46:45
Message-ID: CAFcNs+rEq630jAMTf3f29Dtw-=h9K4RoqkYzVP_cWfhN_H21JQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 7, 2017 at 3:30 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>
>
> On June 7, 2017 11:29:28 AM PDT, "Fabrízio de Royes Mello" <
fabriziomello(at)gmail(dot)com> wrote:
> >On Fri, Jun 2, 2017 at 6:37 PM, Fabrízio de Royes Mello <
> >fabriziomello(at)gmail(dot)com> wrote:
> >>
> >>
> >> On Fri, Jun 2, 2017 at 6:32 PM, Fabrízio de Royes Mello <
> >fabriziomello(at)gmail(dot)com> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > This week I faced a out of disk space trouble in 8TB production
> >cluster. During investigation we notice that pg_replslot was the
> >culprit
> >growing more than 1TB in less than 1 (one) hour.
> >> >
> >> > We're using PostgreSQL 9.5.6 with pglogical 1.2.2 replicating to a
> >new
> >9.6 instance and planning the upgrade soon.
> >> >
> >> > What I did? I freed some disk space just to startup PostgreSQL and
> >begin the investigation. During the 'startup recovery' simply the files
> >inside the pg_replslot was tottaly removed. So our trouble with 'out of
> >disk space' disappear. Then the server went up and physical slaves
> >attached
> >normally to master but logical slaves doesn't, staying stalled in
> >'catchup'
> >state.
> >> >
> >> > At this moment the "pg_replslot" directory started growing fast
> >again
> >and forced us to drop the logical replication slot and we lost the
> >logical
> >slave.
> >> >
> >> > Googling awhile I found this thread [1] about a similar issue
> >reported
> >by Dmitriy Sarafannikov and replied by Andres and Álvaro.
> >> >
> >> > I ran the test case provided by Dmitriy [1] against branches:
> >> > - REL9_4_STABLE
> >> > - REL9_5_STABLE
> >> > - REL9_6_STABLE
> >> > - master
> >> >
> >> > After all test the issue remains... and also using the new Logical
> >Replication stuff (CREATE PUB/CREATE SUB). Just after a restart the
> >"pg_replslot" was properly cleaned. The typo in
> >ReorderBufferIterTXNInit
> >complained by Dimitriy was fixed but the issue remains.
> >> >
> >> > Seems no one complain again about this issue and the thread was
> >lost.
> >> >
> >> > The attached is a reworked version of Dimitriy's patch that seems
> >solve
> >the issue. I confess I don't know enough about replication slots code
> >to
> >really know if it's the best solution.
> >> >
> >> > Regards,
> >> >
> >> > [1]
> >
https://www.postgresql.org/message-id/1457621358.355011041%40f382.i.mail.ru
> >> >
> >>
> >> Just adding Dimitriy to conversation... previous email I provided was
> >wrong.
> >>
> >
> >Does anyone have some thought about this critical issue?
> >
>
> I plan to look into it over the next few days.
>

Thanks...

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-06-07 18:57:22 Re: PostgreSQL 10 changes in exclusion constraints - did something change? CASE WHEN behavior oddity
Previous Message Robert Haas 2017-06-07 18:31:13 Re: Race conditions with WAL sender PID lookups