Page freezing, FSM, and WAL replay

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Page freezing, FSM, and WAL replay
Date: 2018-08-02 17:28:57
Message-ID: 20180802172857.5skoexsilnjvgruk@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We recently had a customer report a very strange problem, involving a
very large insert-only table: without explanation, insertions would
stall for several seconds, causing application timeout and process
accumulation and other nastiness.

After some investigation, we narrowed this down to happening immediately
after the first VACUUM on the table right after a standby got promoted.
It wasn't at first obvious what the connection between these factors
was, but eventually we realized that VACUUM must have been skipping a
bunch of pages because they had been marked all-frozen previously, so
the FSM was not updated with the correct freespace figures for those
pages. The FSM pages had been transmitted as full-page images on WAL
before the promotion (because wal_log_hints), so they contained
optimistic numbers on amount of free space coming from the previous
master. (Because this only happens on the first change to that FSM page
after a checkpoint, it's quite likely that one page every few thousand
or so contains optimistic figures while the others remain all zeroes, or
something like that.)

Before VACUUM, nothing too bad would happen, because the upper layers of
the FSM would not know about those optimistic numbers. But when VACUUM
does FreeSpaceMapVacuum, it propagates those numbers upwards; as soon as
that happens, inserters looking for pages would be told about those
pages (wrongly catalogued to contain sufficient free space), go to
insert there, and fail because there isn't actually any freespace; ask
FSM for another page, lather, rinse, repeat until all those pages are
all catalogued correctly by FSM, at which point things continue
normally. (There are many processes doing this chase-up concurrently
and it seems a pretty contentious process, about which see last
paragraph; it can be seen in pg_xlogdump that it takes several seconds
for things to settle).

After considering several possible solutions, I propose to have
heap_xlog_visible compute free space for any page being marked frozen;
Pavan adds to that to have heap_xlog_clean compute free space for all
pages also. This means that if we later promote this standby and VACUUM
skips all-frozen pages, their FSM numbers are going to be up-to-date
anyway. Patch attached.

Now, it's possible that the problem occurs for all-visible pages not
just all-frozen. I haven't seen that one, maybe there's some reason why
it cannot. But fixing both things together is an easy change in the
proposed patch: just do it on xlrec->flags != 0 rather than checking for
the specific all-frozen flag.

(This problem seems to be made worse by the fact that
RecordAndGetPageWithFreeSpace (or rather fsm_set_and_search) holds
exclusive lock on the FSM page for the whole duration of update plus
search. So when there are many inserters, they all race to the update
process. Maybe it'd be less terrible if we would release exclusive
after the update and grab shared lock for the search in
fsm_set_and_search, but we still have to have the exclusive for the
update, so the contention point remains. Maybe there's not sufficient
improvement to make a practical difference, so I'm not proposing
changing this.)

--
Álvaro Herrera

Attachment Content-Type Size
fix-fsm-replay.patch text/plain 2.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-08-02 17:44:58 Re: Explain buffers wrong counter with parallel plans
Previous Message Alexandra Ryzhevich 2018-08-02 17:25:14 [PATCH] Add regress test for pg_read_all_stats role