| From: | SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Lakshmi N <lakshmin(dot)jhs(at)gmail(dot)com> |
| Subject: | Re: Skip prefetch for block references that follow a FPW or WILL_INIT of the same block |
| Date: | 2026-05-07 07:45:54 |
| Message-ID: | CAHg+QDc761t7QpvyMU_ZaRyfnET_9xg0Vvt2kMbG7do6titTTg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Tue, Mar 24, 2026 at 9:18 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram(at)gmail(dot)com> wrote:
> Hi Hackers,
>
> While review the patch in the thread [1] I noticed the following:
>
> When the WAL prefetcher encounters a block reference that carries a full
> page image (FPW) or has BKPBLOCK_WILL_INIT set, it correctly skips issuing
> a prefetch for that block because the old on-disk content is irrelevant
> since replay will overwrite or zero the page entirely. However, if a later
> WAL record within the look-ahead window references the same block without
> an FPW, the prefetcher would still issue a fadvise64 syscall for it,
> because the block was never recorded in the duplicate-detection window.
>
> Fixed this by making these blocks as recently seen in the FPW and
> WILL_INIT skip paths. The existing duplicate-check loop then naturally
> suppresses prefetch attempts for subsequent references to the same block,
> counting them under the skip_rep stat. This is particularly effective for
> workloads that produce many sequential writes to the same page (e.g., bulk
> inserts into heap-only tables), where each page's first post-checkpoint
> touch generates an FPW and subsequent inserts to the same page follow
> shortly after in WAL.
>
> In order to further improve the wasted prefetch calls, we can try to
> increase the window size by changing XLOGPREFETCHER_SEQ_WINDOW_SIZE
> according to max blocks that can be prefetched or maintain a hash table. I
> did not attempt to do this in this patch because that can impact the redo
> performance (more cpu cycles). Worst case, the current fix may fail in
> scenarios where the table has more than four indexes, for example. However,
> I still believe it is an improvement over the baseline. If we decide to
> spend more cycles on optimizing the window sizes, it can be in a different
> patch.
>
> Benchmarked recovery with 10 GB of WAL from insert-only workload into a
> no-index table, replayed from an identical crash snapshot:
>
> Fast disk (NVMe)
> Baseline: redo 37.30s, system CPU 9.38s, 1,204,992 fadvise calls
> Patched: redo 25.78s, system CPU 3.39s, 122,753 fadvise calls
>
> This is nearly 31% faster redo, 90% fewer fadvise syscalls
>
> *Prefetch Counters*
> Counter Baseline Patched Delta
> prefetch (fadvise issued) 1,204,992 122,753 −89.8%
> hit 924,457 911,785 −1.4%
> skip_init 1,097,536 1,097,536 0
> skip_fpw 28 28 0
> skip_rep 80,020,209 81,115,120 +1,094,911
>
> Slower disk (with ~2ms latency)
> Baseline: redo 188.04s, system CPU 6.87s, 1,204,992 fadvise calls
> Patched: redo 60.02s, system CPU 3.39s, 122,753 fadvise calls
>
> This is nearly 68% faster redo, 3.1× overall speedup
>
>
> *Configuration:*
>
> shared_buffers = '124GB'
> huge_pages = on
> wal_buffers = '512MB'
> max_wal_size = '100GB'
> checkpoint_timeout = '30min'
> full_page_writes = on
> maintenance_io_concurrency = 50
> recovery_prefetch = on
>
> *Workload:*
> CREATE TABLE test_noindex(id bigint, val1 int, val2 int, payload text);
> -- No indexes, no primary key.
>
>
> -- Then insert in batches of 1M rows until WAL reaches 10 GB:
> INSERT INTO test_noindex
> SELECT g, (g*7+13)%100000, (g*31+17)%100000, repeat(chr(65+(g%26)),60)
> FROM generate_series(1, 1000000) g;
>
>
> Thanks,
> Satya
>
> [1]
> https://www.postgresql.org/message-id/flat/CA%2B3i_M8C%2BrK9vhwBm8U%2Bys2hbDifoBb4Xnws5Wmn2f4u7iqOpA%40mail.gmail.com#8eac90e696baf6e4f58f91482af28e07
>
Rebased the patch.
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-xlogprefetcher-record-FPW-WILL_INIT-blocks-in-the-re.patch | application/octet-stream | 3.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Hayato Kuroda (Fujitsu) | 2026-05-07 07:47:01 | RE: [PATCH] Preserve replication origin OIDs in pg_upgrade |
| Previous Message | SATYANARAYANA NARLAPURAM | 2026-05-07 07:37:13 | [Bug] Add the missing RTE_GRAPH_TABLE case to transformLockingClause() |