| From: | pg254kl(at)georgiou(dot)vip |
|---|---|
| To: | MentionTheElephant <MentionTheElephant(at)gmail(dot)com> |
| Cc: | pgsql-performance(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Seeking guidance on extremely slow pg_restore despite strong I/O performance |
| Date: | 2025-12-05 20:27:38 |
| Message-ID: | 176496646088.6.4993161944894462260.1049766710@georgiou.vip |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-performance |
If the dump was taken with pd_dump -Fd and pg_restore -j has no effect on restore time, that’s a good clue.
You can start with testing deferring checkpoints, by setting wal_max_size = 1TB and checkpoint_timeout = 10h, and see how this affects the pg_restore (should be limited by WAL write throughput). Perhaps increase wal_buffers to 128MB. The idea being to identify (by elimination) the write chock-point, before starting to tune for it.
Irrelevant for your problem, you should set the *_io_concurrency to 200 since you use SSDs.
pg_restore rebuilds indices so also make sure the settings relevant to index building are set appropriately (see max_parallel_* and *_io_concurrency)
Kiriakos Georgiou
> On Dec 5, 2025, at 5:30 AM, MentionTheElephant - MentionTheElephant at gmail.com <mentiontheelephant_at_gmail_com_xpdkqvpqqa(at)simplelogin(dot)co> wrote:
>
> Hello,
>
> I would greatly appreciate your insight into an issue where pg_restore
> runs significantly slower than expected, even though the underlying
> storage shows very high random write throughput. I am trying to
> understand which PostgreSQL mechanisms or system layers I should
> investigate next in order to pinpoint the bottleneck and improve
> restore performance.
>
> The central question is: What should I examine further to understand
> why checkpoint processing becomes the dominant bottleneck during
> restore, despite fsync=off, synchronous_commit=off, and excellent
> random write latency?
>
> Below is a detailed description of the environment, the behavior
> observed, the steps I have already taken, and the research performed
> so far.
>
> During pg_restore, execution time remains extremely long: around 2+
> hours using a custom-format dump and over 4 hours using directory
> format. The machine consistently demonstrates high random write
> performance (median latency ~5 ms, ~45k random write IOPS), yet
> PostgreSQL logs show very long checkpoints where the write phase
> dominates (hundreds to thousands of seconds). Checkpoints appear to
> stall the entire restore process.
>
> I have tested multiple combinations of dump formats (custom and
> directory) and parallel jobs (j = 1, 12, 18). The restore duration
> barely changes. This strongly suggests that the bottleneck is not
> client-side parallelism but internal server behavior—specifically the
> checkpoint write phase.
>
> Example log excerpts show checkpoint write times consistently in the
> range of 600–1100 seconds, with large numbers of buffers written (from
> hundreds of thousands to over 1.6 million). Sync times remain
> negligible because fsync is disabled, reinforcing the suspicion that
> PostgreSQL's internal buffer flushing and write throttling mechanisms
> are the source of slowdown, not WAL or filesystem sync.
>
> Given that:
>
> * Storage is fast,
> * fsync and synchronous commits are disabled,
> * full_page_writes is off,
> * wal_level is minimal,
> * autovacuum is off,
> * the restore is the only workload,
>
> I am trying to determine what further PostgreSQL internals or Linux
> I/O mechanisms may explain why these checkpoints are taking orders of
> magnitude longer than the device’s raw write characteristics would
> suggest.
>
> I am particularly looking for guidance on:
>
> * Whether backend or checkpointer write throttling may still be
> limiting write concurrency even during bulk restore,
> * Whether XFS on Hyper-V VHDX + LVM + battery-backed SSD could
> introduce any serialization invisible to raw I/O tests,
> * Whether certain parameters (e.g., effective_io_concurrency,
> maintenance_io_concurrency, wal_writer settings, combine limits,
> io_uring behavior) could unintentionally reduce write throughput,
> * Whether parallel pg_restore is inherently constrained by global
> buffer flushing behavior,
> * Any other PostgreSQL mechanisms that could cause prolonged
> checkpoint write durations even with crash-safety disabled.
>
> Below are the configuration values and environment details referenced above.
>
> Machine:
> Hyper-V VM
> 24 vCPU
> 80 GB RAM
> Ubuntu 24.04.3 (kernel 6.8.0-88)
> PostgreSQL 18.1
>
> Database size:
> ~700 GB across two tablespaces on separate disks (freshly restored)
>
> Storage layout:
> Each disk is its own VHDX
> LVM on battery-backed SSD array
> XFS for PGDATA
> Barriers disabled
>
> Random write performance (steady state):
> Median latency: 5.1 ms
> IOPS: ~45.6k
>
> Restore tests:
> pg_restore custom format: ~2h+
> pg_restore directory format: ~4h+
> Parallelism tested with j = 1, 12, 18, 24
>
> Representative checkpoint log entries:
> (write phases ranging 76–1079 seconds, buffer writes up to 1.6M)
>
> postgresql.conf (relevant parts):
> shared_buffers = 20GB
> work_mem = 150MB
> maintenance_work_mem = 8GB
> effective_io_concurrency = 1
> maintenance_io_concurrency = 1
> io_max_combine_limit = 512kB
> io_combine_limit = 1024kB
> io_method = io_uring
>
> fsync = off
> synchronous_commit = off
> wal_sync_method = fdatasync
> full_page_writes = off
> wal_compression = lz4
>
> checkpoint_timeout = 60min
> checkpoint_completion_target = 0.9
> max_wal_size = 80GB
> min_wal_size = 10GB
>
> effective_cache_size = 65GB
> autovacuum = off
> max_locks_per_transaction = 256
>
> If anyone has encountered similar behavior or can recommend specific
> PostgreSQL subsystems, kernel settings, or I/O patterns worth
> investigating, I would be very grateful for advice. My main goal is to
> understand why checkpoint writes are so slow relative to the
> hardware’s demonstrated capabilities, and how to safely accelerate the
> restore workflow.
>
> Thank you in advance for any guidance.
>
>
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jim Jones | 2025-12-06 11:29:48 | Re: proposal: schema variables |
| Previous Message | MentionTheElephant | 2025-12-05 10:30:38 | Seeking guidance on extremely slow pg_restore despite strong I/O performance |