Re: Directory pg_replslot is not properly cleaned

From: Andres Freund <andres(at)anarazel(dot)de>
To: fabriziomello(at)gmail(dot)com, Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>,Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>,d(dot)sarafannikov(at)bk(dot)ru
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Directory pg_replslot is not properly cleaned
Date: 2017-06-07 18:30:23
Message-ID: A6F30A24-1558-439E-B5AC-C7F72E527313@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On June 7, 2017 11:29:28 AM PDT, "Fabrízio de Royes Mello" <fabriziomello(at)gmail(dot)com> wrote:
>On Fri, Jun 2, 2017 at 6:37 PM, Fabrízio de Royes Mello <
>fabriziomello(at)gmail(dot)com> wrote:
>>
>>
>> On Fri, Jun 2, 2017 at 6:32 PM, Fabrízio de Royes Mello <
>fabriziomello(at)gmail(dot)com> wrote:
>> >
>> > Hi all,
>> >
>> > This week I faced a out of disk space trouble in 8TB production
>cluster. During investigation we notice that pg_replslot was the
>culprit
>growing more than 1TB in less than 1 (one) hour.
>> >
>> > We're using PostgreSQL 9.5.6 with pglogical 1.2.2 replicating to a
>new
>9.6 instance and planning the upgrade soon.
>> >
>> > What I did? I freed some disk space just to startup PostgreSQL and
>begin the investigation. During the 'startup recovery' simply the files
>inside the pg_replslot was tottaly removed. So our trouble with 'out of
>disk space' disappear. Then the server went up and physical slaves
>attached
>normally to master but logical slaves doesn't, staying stalled in
>'catchup'
>state.
>> >
>> > At this moment the "pg_replslot" directory started growing fast
>again
>and forced us to drop the logical replication slot and we lost the
>logical
>slave.
>> >
>> > Googling awhile I found this thread [1] about a similar issue
>reported
>by Dmitriy Sarafannikov and replied by Andres and Álvaro.
>> >
>> > I ran the test case provided by Dmitriy [1] against branches:
>> > - REL9_4_STABLE
>> > - REL9_5_STABLE
>> > - REL9_6_STABLE
>> > - master
>> >
>> > After all test the issue remains... and also using the new Logical
>Replication stuff (CREATE PUB/CREATE SUB). Just after a restart the
>"pg_replslot" was properly cleaned. The typo in
>ReorderBufferIterTXNInit
>complained by Dimitriy was fixed but the issue remains.
>> >
>> > Seems no one complain again about this issue and the thread was
>lost.
>> >
>> > The attached is a reworked version of Dimitriy's patch that seems
>solve
>the issue. I confess I don't know enough about replication slots code
>to
>really know if it's the best solution.
>> >
>> > Regards,
>> >
>> > [1]
>https://www.postgresql.org/message-id/1457621358.355011041%40f382.i.mail.ru
>> >
>>
>> Just adding Dimitriy to conversation... previous email I provided was
>wrong.
>>
>
>Does anyone have some thought about this critical issue?
>

I plan to look into it over the next few days.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-06-07 18:31:13 Re: Race conditions with WAL sender PID lookups
Previous Message Fabrízio de Royes Mello 2017-06-07 18:29:28 Re: Directory pg_replslot is not properly cleaned