Problem with FSM & streaming replication

From: Philip Scott <safetyfirstphil(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Problem with FSM & streaming replication
Date: 2016-07-08 11:47:43
Message-ID: CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Folks,

Earlier this week we simulated failing over to our secondary data center.
We have been using streaming replication, with the slaves available for
read-only queries.

Shortly after failing over, we began getting the following message when
inserting into one particular table:

could not read block 11 in file "base/16401/16748": read only 0 of 8192
bytes

This particular relation is only 10 blocks long.

I got us running again by COPY-ing the table, truncating it, and COPY-ing
the data back in. However I bought up a backup to investigate the problem.

You can replicate it by downloading the following database, and running
'stress.py' (though you will need python & psycopg2 to do so)

https://www.dropbox.com/s/vid55gahlu07tm0/2016-07-04-basic3.tar.bz2?dl=0

After a deep dive into the postgres source code, finding things I never
knew I could do with GDB (record and rewind, wow!), and with help from the
guys at 2ndquadrant, we have determined that the cause is a free space map
that contains an entry for the non-existent block 11.
RecordAndGetPageWithFreeSpace is looking at the FSM and returning block 11.
We then try and load that block and then it's game over. I have stepped
through the code to verify this, and even built a graphviz graph of a dump
of the FSM structure :)

After discovering the wonders of pageinspect, I was able to look at the
entire database and found 23 tables where there are non-zero FSM entries
for pages numbers which do not exist. Running vacuum on them fixes it. This
is the query I used to check:

WITH stuff
AS (SELECT nspname,
relname,
relpages,
*pg_freespace*(nspname
|| '."'
|| relname
|| '"', relpages) AS freespace
FROM pg_class
JOIN pg_catalog.pg_namespace AS n
ON n.oid = pg_class.relnamespace)
SELECT *
FROM stuff
WHERE freespace != 0

So I guess I have a couple of questions:

- Is it valid for the FSM to contain an non-zero leaf entry for a block
that does not exist? If so, should I submit a patch which checks before
loading that the block we looked up in the FSM is < relpages
- If it is not valid, is there possibly a problem with streaming
replication with regards to the FSM? I am at a bit of a loss as to where to
continue investigating that.
- Is it perhaps that in some point in history, possibly with an older
version of postgres, these FSMs got messed up but are unlikely to in the
future. I should just vacuum them (or pg_dump/restore?) and carry on with
business as usual.
- As an aside, is VACUUM the only process that truncates a filenode and
reduces the number of pages in a relation?

I note with interest that the FSM is not WAL logged, and wonder if
something like the following could occur:

- On the master, we have 11 blocks.
- We rsync / pg_basebackup the whole pg_data directory to the slave
- On the master, we delete some stuff, vacuum the table, now we have
only 10 blocks
- Table is WAL logged, so the main relation data file is truncated on
the master and the slave
- However, since the FSM is not WAL logged, the FSM on the slave is not
updated to know this and still talks about block 11
- We go looking for a space to put a lovely new tuple, and by chance
start somewhere in the tree that leads us eventually to block 11
- Try and read block 11
- Sad times

Some possibly useful facts:

- We are on postgres 9.5.3, running RHEL 6. The database has never to my
knowledge crashed; they are 'proper' server boxes, battery backed cache,
UPS, in a data center etc..
- The tables that are effected are mostly ones where there are many
INSERTs and DELETEs, but the table does not change much in size over time.

Thank you very much for your time.

Kind Regards,

Philip

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Prashanth Adiyodi 2016-07-08 12:03:38 Re: [BUGS] Where clause in pg_dump: need help
Previous Message Michael Paquier 2016-07-08 11:10:38 Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby