Re: Reduce/eliminate the impact of FPW

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Daniel Wood <hexexpert(at)comcast(dot)net>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reduce/eliminate the impact of FPW
Date: 2020-08-03 15:26:59
Message-ID: CA+TgmoaLdyBCSfQb=8+zthN0Oyfs0zE4HQuRE6wR+EibxHnNDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 3, 2020 at 5:26 AM Daniel Wood <hexexpert(at)comcast(dot)net> wrote:
> If we can't eliminate FPW's can we at least solve the impact of it? Instead of writing the before images of pages inline into the WAL, which increases the COMMIT latency, write these same images to a separate physical log file. The key idea is that I don't believe that COMMIT's require these buffers to be immediately flushed to the physical log. We only need to flush these before the dirty pages are written. This delay allows the physical before image IO's to be decoupled and done in an efficient manner without an impact to COMMIT's.

I think this is what's called a double-write buffer, or what was tried
some years ago under that name. A significant problem is that you
have to fsync() the double-write buffer before you can write the WAL.
So instead of this:

- write WAL to OS
- fsync WAL

You have to do this:

- write double-write buffer to OS
- fsync double-write buffer
- write WAL to OS
- fsync WAL

Note that you cannot overlap these steps -- the first fsync must be
completed before the second write can begin, else you might try to
replay WAL for which the double-write buffer information is not
available.

Because of this, I think this is actually quite expensive. COMMIT
requires the WAL to be flushed, unless you configure
synchronous_commit=off. So this would double the number of fsyncs we
have to do. It's not as bad as all that, because the individual fsyncs
would be smaller, and that makes a significant difference. For a big
transaction that writes a lot of WAL, you'd probably not notice much
difference; instead of writing 1000 pages to WAL, you might write 770
pages to the double-write buffer and 270 to the double-write buffer,
or something like that. But for short transactions, such as those
performed by pgbench, you'd probably end up with a lot of cases where
you had to write 3 pages instead of 2, and not only that, but the
writes have to be consecutive rather than simultaneous, and to
different parts of the disk rather than sequential. That would likely
suck a lot.

It's entirely possible that these kinds of problems could be mitigated
through really good engineering, maybe to the point where this kind of
solution outperforms what we have now for some or even all workloads,
but it seems equally possible that it's just always a loser. I don't
really know. It seems like a very difficult project.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-08-03 15:33:35 Re: Confusing behavior of create table like
Previous Message Stephen Frost 2020-08-03 15:26:27 Re: public schema default ACL