Re: logical replication: could not create file "state.tmp": File exists

From: Grigory Smolkin <g(dot)smolkin(at)postgrespro(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: logical replication: could not create file "state.tmp": File exists
Date: 2019-12-02 17:49:04
Message-ID: 14b3454f-2d68-c637-68e4-2b42ff976168@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


On 12/2/19 7:12 PM, Andres Freund wrote:
> Hi,
>
> On 2019-11-30 15:09:39 +0300, Grigory Smolkin wrote:
>> One of my colleagues encountered an out of space condition, which broke his
>> logical replication setup.
>> It`s manifested with the following errors:
>>
>> ERROR:  could not receive data from WAL stream: ERROR:  could not create
>> file "pg_replslot/some_sub/state.tmp": File exists
> Hm. What was the log output leading to this state? Some cases of this
> would end up in a PANIC, which'd remove the .tmp file during
> recovery. But there's some where we won't - it seems the right fix for
> this would be to unlink the tmp file in that case?
>
>
>> I`ve digged a bit into this problem, and it`s turned out that in
>> SaveSlotToPath() temp file for replication slot is opened with 'O_CREAT |
>> O_EXCL' flags, which makes this routine as not very reentrant.
>>
>> Since an exclusive lock is taken before temp file creation, I think it
>> should be safe to replace O_EXCL with O_TRUNC.
> I'm very doubtful about this. I think it's a good safety measure to
> ensure that there's no previous state file that we're somehow
> overwriting.
>
>
>> Script to reproduce and patch are attached.
> Well:
>
>> # Imitate out_of_space/write_operation_error
>> touch ${PGDATA_PUB}/pg_replslot/mysub/state.tmp
> Doesn't really replicate how we got into this state...

I`ve managed to reproduce the issue using the attached script:

2019-12-02 20:32:12.547 MSK [4180] ERROR:  could not write to file
"pg_logical/snapshots/0-160E9A28.snap.4180.tmp": No space left on device
2019-12-02 20:32:12.548 MSK [4178] ERROR:  could not receive data from
WAL stream: ERROR:  could not write to file
"pg_logical/snapshots/0-160E9A28.snap.4180.tmp": No space left on device
2019-12-02 20:32:12.557 MSK [4645] LOG:  logical replication apply
worker for subscription "mysub" has started
2019-12-02 20:32:12.562 MSK [4091] LOG:  background worker "logical
replication worker" (PID 4178) exited with exit code 1
2019-12-02 20:32:12.569 MSK [4648] LOG:  starting logical decoding for
slot "mysub"
2019-12-02 20:32:12.569 MSK [4648] DETAIL:  Streaming transactions
committing after 0/160E9A28, reading WAL from 0/1566A2D8.
2019-12-02 20:32:12.569 MSK [4648] LOG:  logical decoding found
consistent point at 0/1566A2D8
2019-12-02 20:32:12.569 MSK [4648] DETAIL:  Logical decoding will begin
using saved snapshot.
2019-12-02 20:32:12.672 MSK [4648] ERROR:  could not write to file
"pg_replslot/mysub/state.tmp": No space left on device
2019-12-02 20:32:12.672 MSK [4645] ERROR:  could not receive data from
WAL stream: ERROR:  could not write to file
"pg_replslot/mysub/state.tmp": No space left on device
2019-12-02 20:32:12.672 MSK [4091] LOG:  background worker "logical
replication worker" (PID 4645) exited with exit code 1
2019-12-02 20:32:17.684 MSK [4653] LOG:  logical replication apply
worker for subscription "mysub" has started
2019-12-02 20:32:17.693 MSK [4654] LOG:  starting logical decoding for
slot "mysub"
2019-12-02 20:32:17.693 MSK [4654] DETAIL:  Streaming transactions
committing after 0/160E9A28, reading WAL from 0/1566A2D8.
2019-12-02 20:32:17.693 MSK [4654] LOG:  logical decoding found
consistent point at 0/1566A2D8
2019-12-02 20:32:17.693 MSK [4654] DETAIL:  Logical decoding will begin
using saved snapshot.
2019-12-02 20:32:17.847 MSK [4654] ERROR:  could not write to file
"pg_logical/snapshots/0-160E9A28.snap.4654.tmp": No space left on device
2019-12-02 20:32:17.847 MSK [4653] ERROR:  could not receive data from
WAL stream: ERROR:  could not write to file
"pg_logical/snapshots/0-160E9A28.snap.4654.tmp": No space left on device
2019-12-02 20:32:17.848 MSK [4091] LOG:  background worker "logical
replication worker" (PID 4653) exited with exit code 1
2019-12-02 20:32:22.859 MSK [4658] LOG:  logical replication apply
worker for subscription "mysub" has started
2019-12-02 20:32:22.868 MSK [4659] LOG:  starting logical decoding for
slot "mysub"
2019-12-02 20:32:22.868 MSK [4659] DETAIL:  Streaming transactions
committing after 0/160E9A28, reading WAL from 0/1566A2D8.
2019-12-02 20:32:22.868 MSK [4659] LOG:  logical decoding found
consistent point at 0/1566A2D8
2019-12-02 20:32:22.868 MSK [4659] DETAIL:  Logical decoding will begin
using saved snapshot.
2019-12-02 20:32:22.868 MSK [4659] ERROR:  could not create file
"pg_replslot/mysub/state.tmp": File exists
2019-12-02 20:32:22.869 MSK [4658] ERROR:  could not receive data from
WAL stream: ERROR:  could not create file "pg_replslot/mysub/state.tmp":
File exists

>
> Greetings,
>
> Andres Freund

--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
replication_bug.sh application/x-shellscript 1.7 KB

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2019-12-02 22:52:31 Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch
Previous Message Dmitry Vasiliev 2019-12-02 17:35:53 Re: logical replication: could not create file "state.tmp": File exists