| From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> | 
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> | 
| Cc: | Andy Fan <zhihuifan1213(at)163(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org | 
| Subject: | Re: Pre-allocating WAL files | 
| Date: | 2025-01-22 15:50:59 | 
| Message-ID: | Z5ET4xCZJEIx3bKK@nathan | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Tue, Jan 21, 2025 at 11:23:06AM -0500, Andres Freund wrote:
> On 2025-01-21 10:13:14 -0600, Nathan Bossart wrote:
>> On Tue, Jan 21, 2025 at 09:52:51AM -0600, Nathan Bossart wrote:
>> > On Tue, Jan 21, 2025 at 03:31:27AM +0000, Andy Fan wrote:
>> >> 3. Why is the purpose of preallocated_segments directory? what in my
>> >> mind is we just prellocate the normal filename so that XLogWrite could
>> >> open it directly. This is same as what wal_recycle does and we can reuse
>> >> the same strategy to clean up them if they are not needed anymore.
>> > 
>> > The purpose is to limit the use of pre-allocated segments to only
>> > situations where WAL recycling is not sufficient.  Basically, if writing a
>> > record would require a new segment to be created, we can quickly pull a
>> > pre-allocated one instead of creating it ourselves.  Besides simplifying
>> > matters, this prevents a lot of unnecessary pre-allocation, since many
>> > workloads will almost never need anything beyond the recycled segments.
> 
> I don't really understand that argument - we should be able to predict rather
> precisely whether we need to preallocate or not. We have the recent WAL "fill
> rate", we know the end of the WAL and we can easily track how far ahead of the
> current point we have allocated.  Why preallocate when we have a large reserve
> of "future" segments? Why preallocate in a separate directory when we have no
> future segments?
If we can indeed reliably predict whether we need pre-allocation, then
sure, let's just create future segments directly in pg_wal.  I'm not sure
we could reliably predict whether WAL will be recycled in time, so we might
pre-allocate a bit more than necessary, but that's not too terrible.  My
"pooling" approach was intended to keep the pre-allocation to a minimum
(IME you really only need a couple at any given time) and to avoid the
guesswork involved in predicting.
>> That being said, it would be nice to avoid the fsync() overhead to move a
>> pre-allocated WAL into place.  My first instinct is that would be
>> substantially more complicated and may not actually improve matters all
>> that much, but I agree that it's worth exploring.
> 
> FWIW, I've seen the fsyncs around recycling being a rather substantial
> bottleneck. To the point of the main benefit of larger segments being the
> reduction in number of fsyncs at the end of a checkpoint.  I think we should
> be able to make the fsyncs a lot more efficient by batching them, first rename
> a bunch of files, then fsync them and the directory. The current pattern
> bascially requires a separate filesystem jouranl flush for each WAL segment.
+1, these kinds of fsync() patterns should be fixed.
-- 
nathan
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nathan Bossart | 2025-01-22 15:56:33 | Re: Pre-allocating WAL files | 
| Previous Message | Alexander Kuzmenkov | 2025-01-22 15:44:01 | Quadratic planning time for ordered paths over partitioned tables |