RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

From: Sait Talha Nisanci <Sait(dot)Nisanci(at)microsoft(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Date: 2020-09-01 06:14:52
Message-ID: AM0PR83MB0241FC5D9130502330C2A897912E0@AM0PR83MB0241.EURPRD83.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

The WAL size for "SSD, full_page_writes=on" was 36GB. I currently don't have the exact size for the other rows because my test VMs got auto-deleted. I can possibly redo the benchmark to get pg_waldump stats for each row.

Best,
Talha.

-----Original Message-----
From: Stephen Frost <sfrost(at)snowman(dot)net>
Sent: Sunday, August 30, 2020 3:24 PM
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>; Andres Freund <andres(at)anarazel(dot)de>; Sait Talha Nisanci <Sait(dot)Nisanci(at)microsoft(dot)com>; Thomas Munro <thomas(dot)munro(at)gmail(dot)com>; Dmitry Dolgov <9erthalion6(at)gmail(dot)com>; David Steele <david(at)pgmasters(dot)net>; Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>; pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:
> >* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> >>On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> >>> > Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore.
> >>>
> >>> We don't actually read the page when we're replaying an FPW though..?
> >>> If we don't read it, and we entirely write the page from the FPW,
> >>> how is pre-fetching helping..?
> >>
> >>Suppose there is a checkpoint. Then we replay a record with an FPW,
> >>pre-fetching nothing. Then the buffer gets evicted from
> >>shared_buffers, and maybe the OS cache too. Then, before the next
> >>checkpoint, we again replay a record for the same page. At this
> >>point, pre-fetching should be helpful.
> >
> >Sure- but if we're talking about 25GB of WAL, on a server that's got
> >32GB, then why would those pages end up getting evicted from memory
> >entirely? Particularly, enough of them to end up with such a huge
> >difference in replay time..
> >
> >I do agree that if we've got more outstanding WAL between checkpoints
> >than the system's got memory then that certainly changes things, but
> >that wasn't what I understood the case to be here.
>
> I don't think it's very clear how much WAL there actually was in each
> case - the message only said there was more than 25GB, but who knows
> how many checkpoints that covers? In the cases with FPW=on this may
> easily be much less than one checkpoint (because with scale 45GB an
> update to every page will log 45GB of full-page images). It'd be
> interesting to see some stats from pg_waldump etc.

Also in the message was this:

--
In order to avoid checkpoints during benchmark, max_wal_size(200GB) and
checkpoint_timeout(200 mins) are set to a high value.
--

Which lead me to suspect, at least, that this was much less than a checkpoint, as you suggest. Also, given that the comment was 'run is cancelled when there is a reasonable amount of WAL (>25GB), seems likely that it's at least *around* there.

Ultimately though, there just isn't enough information provided to really be able to understand what's going on. I agree, pg_waldump stats would be useful.

> >>Admittedly, I don't quite understand whether that is what is
> >>happening in this test case, or why SDD vs. HDD should make any
> >>difference. But there doesn't seem to be any reason why it doesn't
> >>make sense in theory.
> >
> >I agree that this could be a reason, but it doesn't seem to quite fit
> >in this particular case given the amount of memory and WAL. I'm
> >suspecting that it's something else and I'd very much like to know if
> >it's a general "this applies to all (most? a lot of?) SSDs because
> >the hardware has a larger than 8KB page size and therefore the kernel
> >has to read it", or if it's something odd about this particular
> >system and doesn't apply generally.
>
> Not sure. I doubt it has anything to do with the hardware page size,
> that's mostly transparent to the kernel anyway. But it might be that
> the prefetching on a particular SSD has more overhead than what it saves.

Right- I wouldn't have thought the hardware page size would matter either, but it's entirely possible that assumption is wrong and that it does matter for some reason- perhaps with just some SSDs, or maybe with a lot of them, or maybe there's something else entirely going on. About all I feel like I can say at the moment is that I'm very interested in ways to make WAL replay go faster and it'd be great to get more information about what's going on here to see if there's something we can do to generally improve WAL replay.

Thanks,

Stephen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2020-09-01 06:18:39 Reloptions for table access methods
Previous Message Junfeng Yang 2020-09-01 06:14:45 Is it possible to set end-of-data marker for COPY statement.