Re: Improve WALRead() to suck data directly from WAL buffers when possible

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date: 2023-10-11 22:43:53
Message-ID: 20231011224353.cl7c2s222dw3de4j@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:
> On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> > One benefit would be that it'd make it more realistic to use direct
> > IO for WAL
> > - for which I have seen significant performance benefits. But when we
> > afterwards have to re-read it from disk to replicate, it's less
> > clearly a win.
>
> Does this patch still look like a good fit for your (or someone else's)
> plans for direct IO here? If so, would committing this soon make it
> easier to make progress on that, or should we wait until it's actually
> needed?

I think it'd be quite useful to have. Even with the code as of 16, I see
better performance in some workloads with debug_io_direct=wal,
wal_sync_method=open_datasync compared to any other configuration. Except of
course that it makes walsenders more problematic, as they suddenly require
read IO. Thus having support for walsenders to send directly from wal buffers
would be beneficial, even without further AIO infrastructure.

I also think there are other quite desirable features that are made easier by
this patch. One of the primary problems with using synchronous replication is
the latency increase, obviously. We can't send out WAL before it has locally
been wirten out and flushed to disk. For some workloads, we could
substantially lower synchronous commit latency if we were able to send WAL to
remote nodes *before* WAL has been made durable locally, even if the receiving
systems wouldn't be allowed to write that data to disk yet: It takes less time
to send just "write LSN: %X/%X, flush LSNL: %X/%X" than also having to send
all the not-yet-durable WAL.

In many OLTP workloads there won't be WAL flushes between generating WAL for
DML and commit, which means that the amount of WAL that needs to be sent out
at commit can be of nontrivial size.

E.g. for pgbench, normally a transaction is about ~550 bytes (fitting in a
single tcp/ip packet), but a pgbench transaction that needs to emit FPIs for
everything is a lot larger: ~45kB (not fitting in a single packet). Obviously
many real world workloads OLTP workloads actually do more writes than
pgbench. Making the commit latency of the latter be closer to the commit
latency of the former when using syncrep would obviously be great.

Of course this patch is just a relatively small step towards that: We'd also
need in-memory buffering on the receiving side, the replication protocol would
need to be improved, we'd likely need an option to explicitly opt into
receiving unflushed data. But it's still a pretty much required step.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-10-11 23:25:34 Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Previous Message Michael Paquier 2023-10-11 22:22:09 Re: The danger of deleting backup_label