| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | wenhui qiu <qiuwenhuifx(at)gmail(dot)com> | 
| Cc: | Tomas Vondra <tomas(at)vondra(dot)me>, Tony Wayne <anonymouslydark3(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: bgwrite process is too lazy | 
| Date: | 2024-10-04 17:49:23 | 
| Message-ID: | cixso3buqeddrsqh3cf4svus3dakho2jwvohstwz64aqttg647@pqd4kwtdcso7 | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 2024-10-04 09:31:45 +0800, wenhui qiu wrote:
> > It's implied, but to make it more explicit: One big efficiency advantage
> of
> > writes by checkpointer is that they are sorted and can often be combined
> into
> > larger writes. That's often a lot more efficient: For network attached
> storage
> > it saves you iops, for local SSDs it's much friendlier to wear leveling.
>
> thank you for explanation, I think bgwrite also can merge io ,It  writes
> asynchronously to the file system cache, scheduling by os, .
Because bgwriter writes are just ordered by their buffer id (further made less
sequential due to only writing out not-recently-used buffers), they are often
effectively random. The OS can't do much about that.
> > Another aspect is that checkpointer's writes are much easier to pace over
> time
> > than e.g. bgwriters, because bgwriter is triggered by a fairly short term
> > signal.  Eventually we'll want to combine writes by bgwriter too, but
> that's
> > always going to be more expensive than doing it in a large batched fashion
> > like checkpointer does.
>
> > I think we could improve checkpointer's pacing further, fwiw, by taking
> into
> > account that the WAL volume at the start of a spread-out checkpoint
> typically
> > is bigger than at the end.
>
> I'm also very keen to improve checkpoints , Whenever I do stress test,
> bgwrite does not write dirty pages when the data set is smaller than
> shard_buffer size,
It *SHOULD NOT* do anything in that situation. There's absolutely nothing to
be gained by bgwriter writing in that case.
> Before the checkpoint, the pressure measurement tps was stable and the
> highest during the entire pressure measurement phase,Other databases
> refresh dirty pages at a certain frequency, at intervals, and at dirty page
> water levels,They have a much smaller impact on performance when
> checkpoints occur
I doubt that slowdown is caused by bgwriter not being active enough. I suspect
what you're seeing is one or more of:
a) The overhead of doing full page writes (due to increasing the WAL
   volume). You could verify whether that's the case by turning
   full_page_writes off (but note that that's not generally safe!) or see if
   the overhead shrinks if you set wal_compression=zstd or wal_compression=lz4
   (don't use pglz, it's too slow).
b) The overhead of renaming WAL segments during recycling. You could see if
   this is related by specifying --wal-segsize 512 or such during initdb.
Greetings,
Andres
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alexander Korotkov | 2024-10-04 18:00:00 | Re: POC, WIP: OR-clause support for indexes | 
| Previous Message | Peter Geoghegan | 2024-10-04 17:43:52 | Re: POC, WIP: OR-clause support for indexes |