Re: Asynchronous and "direct" IO support for PostgreSQL.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Asynchronous and "direct" IO support for PostgreSQL.
Date: 2021-02-23 21:05:45
Message-ID: 20210223210545.67wihyu47z3nyii5@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-02-23 14:58:32 -0500, Greg Stark wrote:
> So firstly this is all just awesome work and I have questions but I
> don't want them to come across in any way as criticism or as a demand
> for more work.

I posted it to get argued with ;).

> The callbacks make me curious about two questions:
>
> 1) Is there a chance that a backend issues i/o, the i/o completes in
> some other backend and by the time this backend gets around to looking
> at the buffer it's already been overwritten again? Do we have to
> initiate I/O again or have you found a way to arrange that this
> backend has the buffer pinned from the time the i/o starts even though
> it doesn't handle the comletion?

The initiator of the IO can just keep a pin for the buffer to prevent
that.

There's a lot of complexity around how to handle pinning and locking
around asynchronous buffer IO. I plan to send a separate email with a
more details.

In short: I made it so that for shared buffer IO holds a separate
refcount for the duration of the IO - that way the issuer can release
its own pin without causing a problem (consider e.g. an error while an
IO is in flight). The pin held by the IO gets released in the completion
callback. There's similar trickery with locks - but more about that later.

> 2) Have you made (or considered making) things like sequential scans
> (or more likely bitmap index scans) asynchronous at a higher level.
> That is, issue a bunch of asynchronous i/o and then handle the pages
> and return the tuples as the pages arrive. Since sequential scans and
> bitmap scans don't guarantee to read the pages in order they're
> generally free to return tuples from any page in any order. I'm not
> sure how much of a win that would actually be since all the same i/o
> would be getting executed and the savings in shared buffers would be
> small but if there are mostly hot pages you could imagine interleaving
> a lot of in-memory pages with the few i/os instead of sitting idle
> waiting for the async i/o to return.

I have not. Mostly because it'd basically break the entire regression
test suite. And probably a lot of user expectations (yes,
synchronize_seqscan exists, but it pretty rarely triggers).

I'm not sure how big the win would be - the readahead for heapam.c that
is in the patch set tries to keep ahead of the "current scan position"
by a certain amount - so it'll issue the reads for the "missing" pages
before they're needed. Hopefully early enough to avoid unnecessary
stalls. But I'm pretty sure there'll be cases where that'd require a
prohibitively long "readahead distance".

I think there's a lot of interesting things along these lines that we
could tackle, but since they involve changing results and/or larger
changes to avoid those (e.g. detecting when sequential scan order isn't
visible to the user) I think it'd make sense to separate them from the
aio patchset itself.

If we had the infrastructure to detect whether seqscan order matters, we
could also switch to the tuple-iteration order to "backwards" in
heapgetpage() - right now iterating forward in "itemid order" causes a
lot of cache misses because the hardware prefetcher doesn't predict that
the tuples are layed out "decreasing pointer order".

> > ## Stats
> >
> > There are two new views: pg_stat_aios showing AIOs that are currently
> > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.
>
> This is impressive. How easy is it to correlate with system aio stats?

Could you say a bit more about what you are trying to correlate?

Here's some example IOs from pg_stat_aios.

┌─[ RECORD 1 ]─┬───────────────────────────────────────────────────────────────────────────────────────────────────┐
│ backend_type │ client backend │
│ id │ 98 │
│ gen │ 13736 │
│ op │ write │
│ scb │ sb │
│ flags │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT │
│ ring │ 5 │
│ owner_pid │ 1866051 │
│ merge_with │ (null) │
│ result │ 0 │
│ desc │ fd: 38, offset: 329588736, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 238576 │

├─[ RECORD 24 ]┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ checkpointer │
│ id │ 1501 │
│ gen │ 15029 │
│ op │ write │
│ scb │ sb │
│ flags │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT │
│ ring │ 3 │
│ owner_pid │ 1865275 │
│ merge_with │ 1288 │
│ result │ 0 │
│ desc │ fd: 24, offset: 105136128, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 202705 │

├─[ RECORD 31 ]┼───────────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ walwriter │
│ id │ 90 │
│ gen │ 26498 │
│ op │ write │
│ scb │ wal │
│ flags │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT │
│ ring │ 5 │
│ owner_pid │ 1865281 │
│ merge_with │ 181 │
│ result │ 0 │
│ desc │ write_no: 17, fd: 12, offset: 6307840, nbytes: 1048576, already_done: 0, bufdata: 0x7f94dd670000 │
└──────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘

And the per-backend AIO view shows information like this (too wide to
display all cols):

SELECT
pid, last_context, backend_type,
executed_total_count,issued_total_count,submissions_total_count, inflight_count
FROM pg_stat_aio_backends sab JOIN pg_stat_activity USING (pid);
┌─────────┬──────────────┬──────────────────────────────┬──────────────────────┬────────────────────┬─────────────────────────┬────────────────┐
│ pid │ last_context │ backend_type │ executed_total_count │ issued_total_count │ submissions_total_count │ inflight_count │
├─────────┼──────────────┼──────────────────────────────┼──────────────────────┼────────────────────┼─────────────────────────┼────────────────┤
│ 1865291 │ 3 │ logical replication launcher │ 0 │ 0 │ 0 │ 0 │
│ 1865296 │ 0 │ client backend │ 85 │ 85 │ 85 │ 0 │
│ 1865341 │ 7 │ client backend │ 9574416 │ 2905321 │ 345642 │ 0 │
...
│ 1873501 │ 3 │ client backend │ 3565 │ 3565 │ 467 │ 10 │
...
│ 1865277 │ 0 │ background writer │ 695402 │ 575906 │ 13513 │ 0 │
│ 1865275 │ 3 │ checkpointer │ 4664110 │ 3488530 │ 1399896 │ 0 │
│ 1865281 │ 3 │ walwriter │ 77203 │ 7759 │ 7747 │ 3 │
└─────────┴──────────────┴──────────────────────────────┴──────────────────────┴────────────────────┴─────────────────────────┴────────────────┘

It's not super obvious at this point but executed_total_count /
issued_total_count shows the rate at which IOs have been
merged. issued_total_count / submissions_total_count shows how many
(already merged) IOs were submitted together in one "submission batch"
(for io_method=io_uring, io_uring_enter()).

So pg_stat_aios should allow to enrich some system stats - e.g. by being
able to split out WAL writes from data file writes. And - except that
the current callbacks for that aren't great - it should even allow to
split the IO by different relfilenodes etc.

I assume pg_stat_aio_backends also can be helpful, e.g. by seeing which
backends currently the deep IO queues that cause latency delays in other
backends, and which backends do a lot of sequential IO (high
executed_total_count / issued_total_count) and which only random...

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Rofail 2021-02-23 21:52:28 Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Previous Message Drouvot, Bertrand 2021-02-23 20:56:28 [BUG] segfault during delete