Re: Asynchronous and "direct" IO support for PostgreSQL.

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Asynchronous and "direct" IO support for PostgreSQL.
Date: 2021-08-09 21:27:18
Message-ID: CAAKRu_bdJQjAddtsWJrrcB8rx60wA6E8y5AcLNwQ9RLsvjqqXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 28, 2021 at 1:37 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > ## AIO API overview
> >
> > The main steps to use AIO (without higher level helpers) are:
> >
> > 1) acquire an "unused" AIO: pgaio_io_get()
> >
> > 2) start some IO, this is done by functions like
> > pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
> >
> > The (read|write|fsync|flush_range) indicates the operation, whereas
> > (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
> >
> > (see below for more details about this design choice - it might or not be
> > right)
> >
> > 3) optionally: assign a backend-local completion callback to the IO
> > (pgaio_io_on_completion_local())
> >
> > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
> > put on a per-backend list of pending IOs. The pending IOs can be explicitly
> > be flushed pgaio_submit_pending(), but will also be submitted if the
> > pending list gets to be too large, or if the current backend waits for the
> > IO.
> >
> > The are two main reasons not to submit the IO immediately:
> > - If adjacent, we can merge several IOs into one "kernel level" IO during
> > submission. Larger IOs are considerably more efficient.
> > - Several AIO APIs allow to submit a batch of IOs in one system call.
> >
> > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
> > backend. When other backends may need to wait for an IO to finish,
> > pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
> > BufferDesc), which can be waited for using pgaio_io_wait_ref().
> >
> > 6) Process the results of the request. If a callback was registered in 3),
> > this isn't always necessary. The results of AIO can be accessed using
> > pgaio_io_result() which returns an integer where negative numbers are
> > -errno, and positive numbers are the [partial] success conditions
> > (e.g. potentially indicating a short read).
> >
> > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
> > another operation (pgaio_io_recycle())
> >
> >
> > Most places that want to use AIO shouldn't themselves need to care about
> > managing the number of writes in flight, or the readahead distance. To help
> > with that there are two helper utilities, a "streaming read" and a "streaming
> > write".
> >
> > The "streaming read" helper uses a callback to determine which blocks to
> > prefetch - that allows to do readahead in a sequential fashion but importantly
> > also allows to asynchronously "read ahead" non-sequential blocks.
> >
> > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
> > to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
> > uses the tids in LVDeadTuples to figure out which blocks are going to be
> > needed. Here's the latter as an example:
> > https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
> >
>
> Attached is a patch on top of the AIO branch which does bitmapheapscan
> prefetching using the PgStreamingRead helper already used by sequential
> scan and vacuum on the AIO branch.
>
> The prefetch iterator is removed and the main iterator in the
> BitmapHeapScanState node is now used by the PgStreamingRead helper.
>
...
>
> Oh, and I haven't done testing to see how effective the prefetching is
> -- that is a larger project that I have yet to tackle.
>

I have done some testing on how effective it is now.

I've also updated the original patch to count the first page (in the
lossy/exact page counts mentioned down-thread) as well as to remove
unused prefetch fields and comments.
I've also included a second patch which adds IO wait time information to
EXPLAIN output when used like:
EXPLAIN (buffers, analyze) SELECT ...

The same commit also introduces a temporary dev GUC
io_bitmap_prefetch_depth which I am using to experiment with the
prefetch window size.

I wanted to share some results from changing the prefetch window to
demonstrate how prefetching is working.

The short version of my results is that the prefetching works:

- with the prefetch window set to 1, the IO wait time is 1550 ms
- with the prefetch window set to 128, the IO wait time is 0.18 ms

DDL and repro details below:

On Andres' AIO branch [1] with my bitmap heapscan prefetching patch set
applied built with the following build flags:
-02 -fno-omit-frame-pointer --with-liburing

And these non-default PostgreSQL settings:
io_data_direct=1
io_data_force_async=off
io_method=io_uring
log_min_duration_statement=0
log_duration=on
set track_io_timing to on;

set max_parallel_workers_per_gather to 0;
set enable_seqscan to off;
set enable_indexscan to off;
set enable_bitmapscan to on;

set effective_io_concurrency to 128;
set io_bitmap_prefetch_depth to 128;

Using this DDL:

drop table if exists bar;
create table bar(a int, b text, c text, d int);
create index bar_idx on bar(a);
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,1000)i;
insert into bar select i%3, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,1000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,200)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%2000, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,10000000)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,2000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,2000)i;
analyze;

And this query:

select * from bar where a > 100 offset 10000000000000;

with the prefetch window set to 1,
the query execution time is:
5496.129 ms

and IO wait time is:
1550.915

mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Limit (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
Buffers: shared hit=1 read=280571
I/O Timings: read=1315.845 wait=1550.915
-> Bitmap Heap Scan on bar (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
Recheck Cond: (a > 100)
Rows Removed by Index Recheck: 400281
Heap Blocks: exact=47915 lossy=197741
Buffers: shared hit=1 read=280571
I/O Timings: read=1315.845 wait=1550.915
-> Bitmap Index Scan on bar_idx (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
Index Cond: (a > 100)
Buffers: shared hit=1 read=34915
I/O Timings: read=1315.845
Planning:
Buffers: shared hit=96 read=30
I/O Timings: read=3.399
Planning Time: 4.378 ms
Execution Time: 5473.404 ms
(18 rows)

with the prefetch window set to 128,
the query execution time is:
3222 ms

and IO wait time is;
0.178 ms

mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Limit (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
Buffers: shared hit=1 read=280571
I/O Timings: read=1339.795 wait=0.178
-> Bitmap Heap Scan on bar (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
Recheck Cond: (a > 100)
Rows Removed by Index Recheck: 400281
Heap Blocks: exact=47915 lossy=197741
Buffers: shared hit=1 read=280571
I/O Timings: read=1339.795 wait=0.178
-> Bitmap Index Scan on bar_idx (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
Index Cond: (a > 100)
Buffers: shared hit=1 read=34915
I/O Timings: read=1339.795
Planning:
Buffers: shared hit=96 read=30
I/O Timings: read=3.488
Planning Time: 4.279 ms
Execution Time: 3434.522 ms
(18 rows)

- Melanie

[1] https://github.com/anarazel/postgres/tree/aio

Attachment Content-Type Size
v2-0002-Add-IO-wait-time-stat-and-add-guc-for-BHS-prefetc.patch text/x-patch 11.6 KB
v2-0001-Use-pgsr-for-AIO-bitmapheapscan.patch text/x-patch 44.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michail Nikolaev 2021-08-09 21:45:17 Re: Slow standby snapshot
Previous Message Tom Lane 2021-08-09 21:26:16 Re: Another regexp performance improvement: skip useless paren-captures