Re: Problem with FSM & streaming replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Philip Scott <safetyfirstphil(at)gmail(dot)com>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Problem with FSM & streaming replication
Date: 2016-07-09 04:21:35
Message-ID: CAA4eK1+p-coo1obVX5JgbTTEqzeHwmSe5i-G1Z1jR3u5oTyyzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Jul 8, 2016 at 5:17 PM, Philip Scott <safetyfirstphil(at)gmail(dot)com> wrote:
> Hi Folks,
>
> Earlier this week we simulated failing over to our secondary data center. We
> have been using streaming replication, with the slaves available for
> read-only queries.
>
> Shortly after failing over, we began getting the following message when
> inserting into one particular table:
>
> could not read block 11 in file "base/16401/16748": read only 0 of 8192
> bytes
>
> This particular relation is only 10 blocks long.
>
> I got us running again by COPY-ing the table, truncating it, and COPY-ing
> the data back in. However I bought up a backup to investigate the problem.
>
> You can replicate it by downloading the following database, and running
> 'stress.py' (though you will need python & psycopg2 to do so)
>
> https://www.dropbox.com/s/vid55gahlu07tm0/2016-07-04-basic3.tar.bz2?dl=0
>
> After a deep dive into the postgres source code, finding things I never knew
> I could do with GDB (record and rewind, wow!), and with help from the guys
> at 2ndquadrant, we have determined that the cause is a free space map that
> contains an entry for the non-existent block 11.
> RecordAndGetPageWithFreeSpace is looking at the FSM and returning block 11.
> We then try and load that block and then it's game over. I have stepped
> through the code to verify this, and even built a graphviz graph of a dump
> of the FSM structure :)
>
> After discovering the wonders of pageinspect, I was able to look at the
> entire database and found 23 tables where there are non-zero FSM entries for
> pages numbers which do not exist. Running vacuum on them fixes it. This is
> the query I used to check:
>
> WITH stuff
> AS (SELECT nspname,
> relname,
> relpages,
> pg_freespace(nspname
> || '."'
> || relname
> || '"', relpages) AS freespace
> FROM pg_class
> JOIN pg_catalog.pg_namespace AS n
> ON n.oid = pg_class.relnamespace)
> SELECT *
> FROM stuff
> WHERE freespace != 0
>
> So I guess I have a couple of questions:
>
> Is it valid for the FSM to contain an non-zero leaf entry for a block that
> does not exist? If so, should I submit a patch which checks before loading
> that the block we looked up in the FSM is < relpages
> If it is not valid, is there possibly a problem with streaming replication
> with regards to the FSM? I am at a bit of a loss as to where to continue
> investigating that.
> Is it perhaps that in some point in history, possibly with an older version
> of postgres, these FSMs got messed up but are unlikely to in the future. I
> should just vacuum them (or pg_dump/restore?) and carry on with business as
> usual.

The mechanism for truncating the FSM during WAL replay seems to be
present since commit 3396000684b41e7e9467d1abc67152b39e697035.

> As an aside, is VACUUM the only process that truncates a filenode and
> reduces the number of pages in a relation?
>
> I note with interest that the FSM is not WAL logged, and wonder if something
> like the following could occur:
>
> On the master, we have 11 blocks.
> We rsync / pg_basebackup the whole pg_data directory to the slave
> On the master, we delete some stuff, vacuum the table, now we have only 10
> blocks
> Table is WAL logged, so the main relation data file is truncated on the
> master and the slave
> However, since the FSM is not WAL logged, the FSM on the slave is not
> updated to know this and still talks about block 11
>

I don't think this is true, see the replay code of XLOG_SMGR_TRUNCATE
(smgr_redo). We do truncate the FSM along with relation. One
possibility that comes to mind is that during replay after it
truncates the relation and before it truncates FSM, somehow user
backend is able to see the FSM. Ideally that shouldn't happen because
we do take AccessExclusiveLock on relation and that lock is WAL logged
as well, so during replay it should have acquired that lock before
replaying the XLOG_SMGR_TRUNCATE.

> We go looking for a space to put a lovely new tuple, and by chance start
> somewhere in the tree that leads us eventually to block 11
> Try and read block 11
> Sad times
>
> Some possibly useful facts:
>
> We are on postgres 9.5.3, running RHEL 6. The database has never to my
> knowledge crashed; they are 'proper' server boxes, battery backed cache,
> UPS, in a data center etc..
>

What are the non-default postgresql.conf settings (like
synchronous_commit, wal_level, etc.)?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2016-07-09 06:30:14 Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
Previous Message Noah Misch 2016-07-09 01:52:05 Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby