Re: WAL recycled despite logical replication slot

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL recycled despite logical replication slot
Date: 2019-09-22 15:32:17
Message-ID: CAMkU=1yrRmUVDq5c+hHUEhLsyT1A-Nx7PrKs9GSXc6nG=Lo_7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 20, 2019 at 11:27 AM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> >
> >Is there an innocent explanation for this? I thought logical replication
> >slots provided an iron-clad guarantee that WAL would be retained until it
> >was no longer needed. I am just using pub/sub, none of the lower level
> >stuff.
> >
>
> I think you're right - this should not happen with replication slots.
> Can you provide more detailed setup instructions, so that I can try to
> reproduce and investigate the isssue?
>

It is a bit messy, because this isn't what I was trying to test.

The basic set up is pretty simple:

On master:

pgbench -i -s 100
create publication pgbench for table pgbench_accounts, pgbench_branches,
pgbench_history , pgbench_tellers;
pgbench -R200 -c4 -j4 -P60 -T360000 -n

on replica:

pgbench -i -s 1
truncate pgbench_history , pgbench_accounts, pgbench_branches,
pgbench_tellers;
create subscription sub CONNECTION 'host=192.168.0.15' publication pgbench;

The messy part: It looked like the synch was never going to finish, so
first I cut the rate down to -R20. Then what I thought I did was drop the
primary key on pgbench_accounts (manually doing a kill -15 on the synch
worker to release the lock), wait for the copy to start again and then
finish and then start getting "ERROR: logical replication target relation
"public.pgbench_accounts" has neither REPLICA IDENTITY index nor PRIMARY
KEY and published relation does not have REPLICA IDENTITY FULL" log
messages, then I re-added the primary key. Then I increased the -R back to
200, and about 50 minutes later got the WAL already removed error.

But now I can't seem to reproduce this, as the next time I tried to do the
synch with no primary key there doesn't seem to be a commit after the COPY
finishes so once it tries to replay the first update, it hits the above "no
primary key" error and then rolls back the **the entire COPY** as well as
the single-row update, and starts the entire COPY over again before you
have a chance to intervene and build the index. So I'm guessing now that
either the lack of a commit (which itself seems like a spectacularly bad
idea) is situation dependent, or the very slow COPY had finished between
the time I had decided to drop the primary key, and time I actually
implemented the drop.

Perhaps important here is that the replica is rather underpowered. Write
IO and fsyncs periodically become painfully slow, which is probably why
there are replication timeouts, and since the problem happened when trying
to reestablish after a timeout I would guess that that is critical to the
issue.

I was running the master with fsync=off, but since the OS never crashed
that should not be the source of corruption.

I'll try some more to reproduce this, but I wanted to make sure there was
actually something here to reproduce, and not just my misunderstanding of
how things are supposed to work.

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2019-09-22 15:45:05 Re: WAL recycled despite logical replication slot
Previous Message Tom Lane 2019-09-22 15:01:52 Re: The flinfo->fn_extra question, from me this time.