Re: Improve WALRead() to suck data directly from WAL buffers when possible

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date: 2022-12-23 10:15:55
Message-ID: CALj2ACXUbvON86vgwTkum8ab3bf1=HkMxQ5hZJZS3ZcJn8NEXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>

Thanks for providing thoughts.

> At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote in
> > The patch introduces concurrent readers for the WAL buffers, so far
> > only there are concurrent writers. In the patch, WALRead() takes just
> > one lock (WALBufMappingLock) in shared mode to enable concurrent
> > readers and does minimal things - checks if the requested WAL page is
> > present in WAL buffers, if so, copies the page and releases the lock.
> > I think taking just WALBufMappingLock is enough here as the concurrent
> > writers depend on it to initialize and replace a page in WAL buffers.
> >
> > I'll add this to the next commitfest.
> >
> > Thoughts?
>
> This adds copying of the whole page (at least) at every WAL *record*
> read,

In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.

> fighting all WAL writers by taking WALBufMappingLock on a very
> busy page while the copying. I'm a bit doubtful that it results in an
> overall improvement.

Well, the tests don't reflect that [1], I've run an insert work load
[2]. The WAL is being read from WAL buffers 99% of the time, which is
pretty cool. If you have any use-cases in mind, please share them
and/or feel free to run at your end.

> It seems to me almost all pread()s here happens
> on file buffer so it is unclear to me that copying a whole WAL page
> (then copying the target record again) wins over a pread() call that
> copies only the record to read.

That's not always guaranteed. Imagine a typical production server with
decent write traffic and heavy analytical queries (which fills OS page
cache with the table pages accessed for the queries), the WAL pread()
calls turn to IOPS. Despite the WAL being present in WAL buffers,
customers will be paying unnecessarily for these IOPS too. With the
patch, we are basically avoiding the pread() system calls which may
turn into IOPS on production servers (99% of the time for the insert
use case [1][2], 95% of the time for pgbench use case specified
upthread). With the patch, WAL buffers can act as L1 cache, if one
calls OS page cache as L2 cache (of course this illustration is not
related to the typical processor L1 and L2 ... caches).

> Do you have an actual number of how
> frequent WAL reads go to disk, or the actual number of performance
> gain or real I/O reduction this patch offers?

It might be a bit tough to generate such heavy traffic. An idea is to
ensure the WAL page/file goes out of the OS page cache before
WALRead() - these might help here - 0002 patch from
https://www.postgresql.org/message-id/CA%2BhUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_%3DX7HztA%40mail.gmail.com
and tool https://github.com/klando/pgfincore.

> This patch copies the bleeding edge WAL page without recording the
> (next) insertion point nor checking whether all in-progress insertion
> behind the target LSN have finished. Thus the copied page may have
> holes. That being said, the sequential-reading nature and the fact
> that WAL buffers are zero-initialized may make it work for recovery,
> but I don't think this also works for replication.

WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.

> I remember that the one of the advantage of reading the on-memory WAL
> records is that that allows walsender to presend the unwritten
> records. So perhaps we should manage how far the buffer is filled with
> valid content (or how far we can presend) in this feature.

Yes, the non-flushed WAL can be read and sent across if one wishes to
to make replication faster and parallel flushing on primary and
standbys at the cost of a bit of extra crash handling, that's
mentioned here https://www.postgresql.org/message-id/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com.
However, this can be a separate discussion.

I also want to reiterate that the patch implemented a TODO item:

* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,

[1]
PATCHED:
1 1470.329907
2 1437.096329
4 2966.096948
8 5978.441040
16 11405.538255
32 22933.546058
64 43341.870038
128 73623.837068
256 104754.248661
512 115746.359530
768 106106.691455
1024 91900.079086
2048 84134.278589
4096 62580.875507

-[ RECORD 1 ]----------+-----------
application_name | assb1
sent_lsn | 0/1B8106A8
write_lsn | 0/1B8106A8
flush_lsn | 0/1B8106A8
replay_lsn | 0/1B8106A8
write_lag |
flush_lag |
replay_lag |
wal_read | 104
wal_read_bytes | 10733008
wal_read_time | 1.845
wal_read_buffers | 76662
wal_read_bytes_buffers | 383598808
wal_read_time_buffers | 205.418
sync_state | async

HEAD:
1 1312.054496
2 1449.429321
4 2717.496207
8 5913.361540
16 10762.978907
32 19653.449728
64 41086.124269
128 68548.061171
256 104468.415361
512 114328.943598
768 91751.279309
1024 96403.736757
2048 82155.140270
4096 66160.659511

-[ RECORD 1 ]----+-----------
application_name | assb1
sent_lsn | 0/1AB5BCB8
write_lsn | 0/1AB5BCB8
flush_lsn | 0/1AB5BCB8
replay_lsn | 0/1AB5BCB8
write_lag |
flush_lag |
replay_lag |
wal_read | 71967
wal_read_bytes | 381009080
wal_read_time | 243.616
sync_state | async

[2] Test details:
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &
1 primary, 1 async standby
cd inst/bin
./pg_ctl -D data -l logfile stop
./pg_ctl -D assbdata -l logfile1 stop
rm -rf data assbdata
rm logfile logfile1
free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m
./initdb -D data
rm -rf /home/ubuntu/archived_wal
mkdir /home/ubuntu/archived_wal
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
track_wal_io_timing = 'on'
EOF
./pg_ctl -D data -l logfile start
./psql -c "select
pg_create_physical_replication_slot('assb1_repl_slot', true, false)"
postgres
./pg_ctl -D data -l logfile restart
./pg_basebackup -D assbdata
./pg_ctl -D data -l logfile stop
cat << EOF >> assbdata/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=assb1'
primary_slot_name='assb1_repl_slot'
restore_command='cp /home/ubuntu/archived_wal/%f %p'
EOF
touch assbdata/standby.signal
./pg_ctl -D data -l logfile start
./pg_ctl -D assbdata -l logfile1 start
./pgbench -i -s 1 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
ulimit -S -n 5000
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Will Mortensen 2022-12-23 10:41:20 Exposing the lock manager's WaitForLockers() to SQL
Previous Message shveta malik 2022-12-23 09:45:54 Re: Force streaming every change in logical decoding