Re: incremental-checkopints

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Thomas wen <Thomas_valentine_365(at)outlook(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: incremental-checkopints
Date: 2023-07-27 10:50:00
Message-ID: e3d657e6-ec84-3089-e458-e7fdc1d71a80@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/26/23 21:53, Matthias van de Meent wrote:
> On Wed, 26 Jul 2023 at 20:58, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>>
>>
>> On 7/26/23 15:16, Matthias van de Meent wrote:
>>> On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>>>>
>>>> Hello
>>>>
>>>> On 2023-Jul-26, Thomas wen wrote:
>>>>
>>>>> Hi Hackes: I found this page :
>>>>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
>>>>> no incremental checkpoints have been implemented so far. When a
>>>>> checkpoint is triggered, the performance jitter of PostgreSQL is very
>>>>> noticeable. I think incremental checkpoints should be implemented as
>>>>> soon as possible
>>>>
>>>> I think my first question is why do you think that is necessary; there
>>>> are probably other tools to achieve better performance. For example,
>>>> you may want to try making checkpoint_completion_target closer to 1, and
>>>> the checkpoint interval longer (both checkpoint_timeout and
>>>> max_wal_size). Also, changing shared_buffers may improve things. You
>>>> can try adding more RAM to the machine.
>>>
>>> Even with all those tuning options, a significant portion of a
>>> checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
>>> general) will most often appear at the start of each checkpoint due to
>>> each first update to a page after a checkpoint needing an FPI.
>>
>> Yeah, FPIs are certainly expensive and can represent huge part of the
>> WAL produced. But how would incremental checkpoints make that step
>> unnecessary?
>>
>>> If instead we WAL-logged only the pages we are about to write to disk
>>> (like MySQL's double-write buffer, but in WAL instead of a separate
>>> cyclical buffer file), then a checkpoint_completion_target close to 1
>>> would probably solve the issue, but with "WAL-logged torn page
>>> protection at first update after checkpoint" we'll probably always
>>> have higher-than-average FPI load just after a new checkpoint.
>>>
>>
>> So essentially instead of WAL-logging the FPI on the first change, we'd
>> only do that later when actually writing-out the page (either during a
>> checkpoint or because of memory pressure)? How would you make sure
>> there's enough WAL space until the next checkpoint? I mean, FPIs are a
>> huge write amplification source ...
>
> You don't make sure that there's enough space for the modifications,
> but does it matter from a durability point of view? As long as the
> page isn't written to disk before the FPI, we can replay non-FPI (but
> fsynced) WAL on top of the old version of the page that you read from
> disk, instead of only trusting FPIs from WAL.
>

It does not matter from durability point of view, I think. But I was
thinking more about how this affects scheduling of checkpoints - how
would you know when the next checkpoint is likely to happen, when you
don't know how many FPIs you're going to write?

>> Imagine the system has max_wal_size set to 1GB, and does 1M updates
>> before writing 512MB of WAL and thus triggering a checkpoint. Now it
>> needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with
>> indexes. What then?
>
> Then you ignore the max_wal_size GUC as PostgreSQL so often already
> does. At least, it doesn't do what I expect it to do at face value -
> limit the size of the WAL directory to the given size.
>

I agree the soft-limit nature of max_wal_size (i.e. best effort, not a
strict limit) is not great. But just ignoring the limit altogether seems
like a step in the wrong direction - we should try not to exceed it.

I wonder if we'd actually need / want to write the FPIs into WAL. AFAICS
we only need the FPI until the page is written and flushed - since that
moment it shouldn't be possible to tear the page. So a small cyclic
buffer separate from WAL would be better ...

> But more reasonably, you'd keep track of the count of modified pages
> that are yet to be fully WAL-logged, and keep that into account as a
> debt that you have to the current WAL insert pointer when considering
> checkpoint distances and max_wal_size.
>

Yeah, that might work. It'd likely be just estimates, but probably good
enough for pacing the writes.

> ---
>
> The main issue that I see with "WAL-logging the FPI only when you
> write the dirty page to disk" is that dirty page flushing also happens
> with buffer eviction in ReadBuffer(). This change in behaviour would
> add a WAL insertion penalty to this write, and make it a very common
> occurrance that we'd have to write WAL + fsync the WAL when we have to
> write the dirty page. It would thus add significant latency to the
> dirty write mechanism, which is probably a unpopular change.

Yeah, it certainly move the latencies from one place to another.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikita Malakhov 2023-07-27 10:56:42 Re: POC: Extension for adding distributed tracing - pg_tracing
Previous Message Shinoda, Noriyoshi (HPE Services Japan - FSIP) 2023-07-27 09:36:35 RE: remaining sql/json patches