| From: | Mingwei Jia <i(at)nayishan(dot)top> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | [RFC PATCH v2 RESEND 01/10] umbra: add patch 0 design notes and repository navigation |
| Date: | 2026-06-01 23:29:20 |
| Message-ID: | 20260601232921.67880-1-i@nayishan.top |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
---
README.md | 261 +++++++++++-
README_ZH.md | 241 +++++++++++
doc/umbra/ARCHITECTURE.md | 437 ++++++++++++++++++++
doc/umbra/ARCHITECTURE_ZH.md | 282 +++++++++++++
doc/umbra/PROTOTYPE.md | 86 ++++
doc/umbra/PROTOTYPE_ZH.md | 74 ++++
doc/umbra/REVIEW_GUIDE.md | 210 ++++++++++
doc/umbra/REVIEW_GUIDE_ZH.md | 133 ++++++
doc/umbra/UMBRA_FPW_STORY.md | 708 ++++++++++++++++++++++++++++++++
doc/umbra/UMBRA_FPW_STORY_ZH.md | 500 ++++++++++++++++++++++
doc/umbra/WAL_AND_REDO.md | 419 +++++++++++++++++++
doc/umbra/WAL_AND_REDO_ZH.md | 248 +++++++++++
12 files changed, 3583 insertions(+), 16 deletions(-)
create mode 100644 README_ZH.md
create mode 100644 doc/umbra/ARCHITECTURE.md
create mode 100644 doc/umbra/ARCHITECTURE_ZH.md
create mode 100644 doc/umbra/PROTOTYPE.md
create mode 100644 doc/umbra/PROTOTYPE_ZH.md
create mode 100644 doc/umbra/REVIEW_GUIDE.md
create mode 100644 doc/umbra/REVIEW_GUIDE_ZH.md
create mode 100644 doc/umbra/UMBRA_FPW_STORY.md
create mode 100644 doc/umbra/UMBRA_FPW_STORY_ZH.md
create mode 100644 doc/umbra/WAL_AND_REDO.md
create mode 100644 doc/umbra/WAL_AND_REDO_ZH.md
diff --git a/README.md b/README.md
index f6104c038b..44bf57c782 100644
--- a/README.md
+++ b/README.md
@@ -1,21 +1,250 @@
-PostgreSQL Database Management System
-=====================================
+# Umbra on PostgreSQL master
-This directory contains the source code distribution of the PostgreSQL
-database management system.
+[English](./README.md) | [中文](./README_ZH.md)
-PostgreSQL is an advanced object-relational database management system
-that supports an extended subset of the SQL standard, including
-transactions, foreign keys, subqueries, triggers, user-defined types
-and functions. This distribution also contains C language bindings.
+This repository hosts the current Umbra prototype on top of PostgreSQL master.
-Copyright and license information can be found in the file COPYRIGHT.
+Umbra is a storage-manager variant in which selected relation forks keep
+ordinary PostgreSQL logical block numbers at the upper layers, but are stored
+through an internal logical-to-physical mapping layer underneath. MAIN, FSM,
+and VM can therefore still be addressed as logical blocks while Umbra
+translates them to physical blocks stored in data-fork files.
-General documentation about this version of PostgreSQL can be found at
-<https://www.postgresql.org/docs/devel/>. In particular, information
-about building PostgreSQL from the source code can be found at
-<https://www.postgresql.org/docs/devel/installation.html>.
+In this model, a remap means moving one logical block from its old physical
+block to a newly published physical block. The purpose of that remap is to
+give ordinary checkpoint-boundary updates a different recovery baseline.
+Instead of overwriting the old physical page and logging a full-page image just
+to protect that overwrite, Umbra can publish a new physical page for the same
+logical block and record the old/new physical mapping in WAL. During redo, a
+remap record is replayed through the mapping view expected by that record; a
+delta-only remap uses the old physical page plus WAL delta instead of treating
+the update as overwrite-in-place on the new physical page. This is the
+mechanism Umbra uses to reduce ordinary full-page-image pressure while
+preserving crash-recovery ordering.
-The latest version of this software, and related software, may be
-obtained at <https://www.postgresql.org/download/>. For more information
-look at our web site located at <https://www.postgresql.org/>.
+This branch family is correctness-first. It is useful for design review,
+implementation reading, and testing. It is not presented as a finished
+production feature.
+
+## Branch Layout
+
+- `umbra-poc-pgmaster`
+ - PostgreSQL master based Umbra PoC
+ - full implementation branch for full-tree reading and testing
+- `shadow-pg12-archive`
+ - archived PostgreSQL 12.2 shadow prototype
+
+## Current Scope
+
+The current implementation includes:
+
+- a `--with-umbra` build option and Umbra `smgr` integration
+- an internal metadata fork per relation that stores:
+ - the MAP superblock, which records fork-level state such as:
+ - logical EOF
+ - physical capacity
+ - committed allocator frontier
+ - MAP pages, which record per-block mapping facts:
+ - `lblk -> pblk` entries
+ - unmapped versus mapped state for ordinary logical blocks
+- a MAP subsystem that owns:
+ - logical-to-physical lookup
+ - shared superblock state and related runtime state for:
+ - logical EOF
+ - allocator/frontier state
+ - reclaim boundaries
+- two background workers:
+ - `mapwriter`
+ - MAP-page flush
+ - preallocation
+ - `mapcompactor`
+ - reclaim
+ - compaction
+- remap-aware WAL/redo support, including:
+ - remap-aware block headers on ordinary WAL records
+ - redo-side remap interpretation in `xlogutils.c`
+- Umbra recovery TAP coverage in `src/test/recovery`
+
+## Design In One Page
+
+Umbra should be read as a storage-layer split with six distinct pieces.
+
+1. Upper PostgreSQL layers keep ordinary logical addressing.
+ Relations, forks, and block numbers are still presented as logical objects
+ to normal PostgreSQL callers. Umbra changes physical placement underneath
+ `smgr`; it does not ask upper layers to reason in physical block numbers.
+
+2. Persistent truth lives in the metadata fork.
+ Each relation has an internal metadata fork containing:
+ - a MAP superblock for fork-level facts such as logical EOF, physical
+ capacity, and the committed allocator frontier; this superblock is stored
+ as a small 512-byte metadata sector
+ - MAP pages for per-block `lblk -> pblk` mapping facts; these pages are
+ compact fixed-entry metadata pages, closer to CLOG-style metadata than to
+ ordinary PostgreSQL data pages, so they do not use ordinary data-page FPW
+ semantics
+
+3. Runtime access is split from physical file I/O.
+ - `umbra.c` owns mapped-fork runtime semantics.
+ - the MAP subsystem owns lookup and shared runtime state.
+ - `umfile.c` owns physical file and segment operations.
+
+4. WAL is the owner boundary for MAP state changes.
+ - fork-level superblock facts become redo-visible through WAL.
+ - physical-page lifecycle transitions become redo-visible through WAL.
+ - logical-to-physical mapping changes become redo-visible through WAL.
+ Ordinary block references carry page-replay remap metadata; Umbra rmgr
+ records cover explicit MAP lifecycle actions outside ordinary block
+ references.
+
+5. Redo replays remap records through the record's expected mapping view.
+ - redo first restores the old/new mapping view carried by the WAL record.
+ - without an image, replay uses the old physical page plus WAL delta, not
+ overwrite-in-place on the new physical page.
+ - with an image, redo installs the image into the newly published mapping.
+
+6. Background maintenance stays separate from the foreground access path.
+ - `mapwriter` handles MAP-page flush and preallocation
+ - `mapcompactor` handles reclaim and compaction
+ This keeps long-term space convergence out of the hot foreground allocation
+ path.
+
+## Documentation
+
+Detailed design notes live under [doc/umbra/](./doc/umbra/).
+
+Primary English documents:
+
+- [Architecture](./doc/umbra/ARCHITECTURE.md)
+- [WAL and Redo](./doc/umbra/WAL_AND_REDO.md)
+- [Review Guide](./doc/umbra/REVIEW_GUIDE.md)
+- [Prototype and Branch Navigation](./doc/umbra/PROTOTYPE.md)
+- [FPW-to-remap design story](./doc/umbra/UMBRA_FPW_STORY.md)
+
+Chinese companion material:
+
+- [Architecture](./doc/umbra/ARCHITECTURE_ZH.md)
+- [WAL and Redo](./doc/umbra/WAL_AND_REDO_ZH.md)
+- [Review Guide](./doc/umbra/REVIEW_GUIDE_ZH.md)
+- [Prototype and Branch Navigation](./doc/umbra/PROTOTYPE_ZH.md)
+- [FPW-to-remap 设计故事](./doc/umbra/UMBRA_FPW_STORY_ZH.md)
+
+## Testing Baseline
+
+The current correctness baseline is the md/Umbra matrix below. When switching
+between modes in the same source tree, clean the previous build first.
+
+```sh
+make distclean
+./configure
+make
+make check
+make -C src/test/recovery check
+
+make distclean
+./configure --with-umbra
+make
+make check
+make -C src/test/recovery check
+```
+
+One especially important recovery test is:
+
+```sh
+make -C src/test/recovery check PROVE_TESTS=t/074_umbra_torn_page_remap.pl
+```
+
+This test acts as a negative control in md mode and validates torn-page remap
+recovery in Umbra mode.
+
+## Preliminary Performance Indicators
+
+Current performance evidence is directional only. Two early signals are worth
+showing together:
+
+- TPCC-style throughput under the same workload
+- WAL-size ratio under the same workload
+
+The throughput view matters because WAL-size reduction alone does not fully
+describe performance. The fair default baseline is:
+
+- `md + fpw=on`
+
+The `md + fpw=off` numbers are useful as a sensitivity / upper-bound reference,
+not as a correctness-equivalent baseline.
+
+Common settings:
+
+- `checkpoint_timeout = 2min`
+- `max_wal_size = 20GB`
+- `shared_buffers = 50GB`
+- `logging_collector = on`
+- `runMins = 10`
+- `newOrderWeight = 45`
+- `paymentWeight = 43`
+- `deliveryWeight = 4`
+- `stockLevelWeight = 4`
+- `orderStatusWeight = 4`
+
+### TPCC-Style Throughput
+
+#### Checksums Disabled
+
+| clients | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| ------- | ------------: | -------------: | ---------------: |
+| 10 | 158709 | 154283 | 155781 |
+| 50 | 577005 | 626954 | 656353 |
+| 200 | 641899 | 981436 | 995635 |
+| 500 | 322660 | 943295 | 859058 |
+| 1000 | 275609 | 899631 | 729989 |
+
+#### Checksums Enabled
+
+| clients | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| ------- | ------------: | -------------: | ---------------: |
+| 10 | 155754 | 152025 | 150606 |
+| 50 | 601974 | 635597 | 650844 |
+| 200 | 621176 | 1015923 | 938311 |
+| 500 | 316950 | 972795 | 729801 |
+| 1000 | 282713 | 891770 | 674865 |
+
+### WAL-Size Ratio
+
+- `md WAL bytes with full_page_writes=on`
+- divided by
+- `Umbra WAL bytes with full_page_writes=on`
+
+Larger values mean Umbra generated less WAL for the same workload.
+
+#### Checksums Disabled
+
+| clients | md WAL / Umbra WAL |
+| ------- | ------------------ |
+| 10 | 2.03 |
+| 50 | 2.51 |
+| 200 | 5.22 |
+| 500 | 6.90 |
+| 1000 | 6.55 |
+
+#### Checksums Enabled
+
+| clients | md WAL / Umbra WAL |
+| ------- | ------------------ |
+| 10 | 1.82 |
+| 50 | 2.11 |
+| 200 | 3.81 |
+| 500 | 4.58 |
+| 1000 | 4.87 |
+
+Taken together, the throughput and WAL-size numbers show that Umbra is not only
+reducing WAL volume. Under the same workload, it also recovers a large part of
+the throughput lost to ordinary checkpoint-boundary full-page-image pressure.
+
+These numbers should be read as:
+
+- preliminary
+- directional
+- not yet a complete benchmark
+
+They should not be read as a final claim about throughput, latency, or full
+replication/recovery cost.
diff --git a/README_ZH.md b/README_ZH.md
new file mode 100644
index 0000000000..dee95115d2
--- /dev/null
+++ b/README_ZH.md
@@ -0,0 +1,241 @@
+# Umbra 在 PostgreSQL master 上的原型说明
+
+[English](./README.md) | [中文](./README_ZH.md)
+
+这个仓库承载了基于 PostgreSQL `master` 的当前 Umbra 原型。
+
+Umbra 可以理解成 PostgreSQL 存储管理层上的一层扩展:上层仍然按普通逻辑
+块号访问数据,而底层通过内部的“逻辑块到物理块”映射,把选定 fork 的内容
+写入实际物理块。这样 `MAIN`、`FSM`、`VM` 这些 fork 在上层看来仍然是普通
+逻辑块,物理布局变化则由 Umbra 在下层负责。
+
+在这个模型里,remap 指的是把同一个逻辑块从旧物理块切换到新发布的物理块。
+它的直接目的,是给 ordinary checkpoint-boundary 更新提供另一种恢复基线。
+传统 `md` 路径会覆盖旧物理页,因此需要 full-page image 来保护这次覆盖;
+Umbra 则可以为同一个逻辑块发布一个新的物理页,并在 WAL 中记录 old/new
+physical mapping。redo 时,remap record 要按该 record 期待的映射视图回放;
+delta-only remap 使用“旧物理页 + WAL delta”,而不是把这次更新理解成在新物理
+页上的原地覆盖。这样 Umbra 才能在保持 crash-recovery 顺序的同时,降低 ordinary
+full-page-image 压力。
+
+这条分支的目标是先把正确性、设计边界和可验证性建立起来。它适合做设计审阅、
+实现阅读和测试,但不应被表述为已经完成的生产特性。
+
+## 分支布局
+
+- `umbra-poc-pgmaster`
+ - 基于 PostgreSQL `master` 的 Umbra 原型分支
+ - 用于完整源码阅读和测试
+- `shadow-pg12-archive`
+ - PostgreSQL 12.2 时代的 `shadow` 原型归档分支
+
+## 当前实现范围
+
+当前实现包含:
+
+- `--with-umbra` 构建选项,以及 Umbra 在 `smgr` 层的接入
+- 每个 relation 的内部 `metadata fork`,用来存放:
+ - MAP superblock,负责记录 fork 级别状态,例如:
+ - 逻辑文件末尾
+ - 已物化的物理容量
+ - 已提交的分配前沿
+ - 普通 MAP page,负责记录逐块映射事实:
+ - `lblk -> pblk` 条目
+ - 普通逻辑块当前是否已经建立映射
+- 一个 MAP 子系统,负责:
+ - 逻辑块到物理块的查找
+ - superblock 共享状态及相关运行时状态管理,包括:
+ - 逻辑文件末尾
+ - 分配前沿
+ - 回收边界
+- 两个后台进程:
+ - `mapwriter`
+ - MAP page 刷盘
+ - 预分配物理空间
+ - `mapcompactor`
+ - 回收
+ - 压缩整理
+- 一组围绕 remap 与 redo 的 WAL/恢复支持,包括:
+ - 普通 WAL record 上的 remap 元数据
+ - `xlogutils.c` 中的 remap 解释与恢复路径
+- `src/test/recovery` 下的 Umbra recovery TAP 测试
+
+## 一页设计摘要
+
+Umbra 可以被理解成一个由六个层次组成的存储层拆分。
+
+1. 上层 PostgreSQL 保持普通逻辑寻址。
+ 对普通 PostgreSQL 调用方来说,relation、fork 和块号这些对象仍然按逻辑
+ 语义使用。Umbra 改变的是 `smgr` 下方的物理布局,而不是要求上层直接处理
+ 物理块号。
+
+2. 持久化真相放在 `metadata fork` 中。
+ 每个 relation 都有一个内部 `metadata fork`,里面包含:
+ - 一个 MAP superblock,用来记录 fork 级别事实,例如逻辑文件末尾、
+ 已物化的物理容量,以及已提交的分配前沿;这个 superblock 是一个很小的
+ `512B` metadata sector
+ - 一组 MAP page,用来记录逐块的 `lblk -> pblk` 映射事实;这些 page 是由
+ 固定大小 entry 组成的紧凑 metadata page,形式上更接近 CLOG 这类
+ metadata,而不是普通 PostgreSQL data page,因此不走普通 data-page FPW
+ 语义
+
+3. 运行时访问路径和物理文件 I/O 明确分层。
+ - `umbra.c` 负责 mapped fork 的运行时访问语义。
+ - MAP 子系统负责查找和共享运行时状态。
+ - `umfile.c` 负责真正的物理文件和 segment 操作。
+
+4. WAL 是 MAP 状态变化的 owner 边界。
+ - fork 级 superblock 事实通过 WAL 成为 redo 可见状态。
+ - 物理页生命周期转换通过 WAL 成为 redo 可见状态。
+ - 逻辑页到物理页的映射变化通过 WAL 成为 redo 可见状态。
+ 普通 block reference 携带页面回放需要的 remap 元数据;普通 block
+ reference 之外的显式 MAP lifecycle 动作由 Umbra rmgr record 表达。
+
+5. redo 按 WAL record 期待的映射视图回放 remap。
+ - redo 先恢复该 record 携带的 old/new mapping view。
+ - 没有 image 时,回放基线是“旧物理页 + WAL delta”,不是在新物理页上做
+ 原地覆盖。
+ - 有 image 时,redo 把 image 安装到新发布的映射上。
+
+6. 后台维护和前台访问路径分开。
+ - `mapwriter` 负责 MAP page 刷盘和预分配
+ - `mapcompactor` 负责回收和压缩整理
+ 这样长周期的空间收敛就不会直接挤进前台热路径。
+
+## 文档
+
+更详细的设计说明放在 [doc/umbra/](./doc/umbra/)。
+
+英文主文档:
+
+- [Architecture](./doc/umbra/ARCHITECTURE.md)
+- [WAL and Redo](./doc/umbra/WAL_AND_REDO.md)
+- [Review Guide](./doc/umbra/REVIEW_GUIDE.md)
+- [Prototype and Branch Navigation](./doc/umbra/PROTOTYPE.md)
+- [FPW-to-remap design story](./doc/umbra/UMBRA_FPW_STORY.md)
+
+中文配套材料:
+
+- [Architecture](./doc/umbra/ARCHITECTURE_ZH.md)
+- [WAL and Redo](./doc/umbra/WAL_AND_REDO_ZH.md)
+- [Review Guide](./doc/umbra/REVIEW_GUIDE_ZH.md)
+- [Prototype and Branch Navigation](./doc/umbra/PROTOTYPE_ZH.md)
+- [FPW-to-remap 设计故事](./doc/umbra/UMBRA_FPW_STORY_ZH.md)
+
+## 测试基线
+
+当前正确性基线是 md/Umbra 双模式矩阵。在同一个源码树里切换构建模式时,
+先清理上一次构建。
+
+```sh
+make distclean
+./configure
+make
+make check
+make -C src/test/recovery check
+
+make distclean
+./configure --with-umbra
+make
+make check
+make -C src/test/recovery check
+```
+
+一个特别重要的恢复测试是:
+
+```sh
+make -C src/test/recovery check PROVE_TESTS=t/074_umbra_torn_page_remap.pl
+```
+
+这个测试在 md 模式下是负对照,在 Umbra 模式下验证 torn-page remap
+recovery。
+
+## 初步性能指标
+
+当前性能证据只能视为方向性信号。这里同时给两类早期指标:
+
+- 同一工作负载下的 TPCC 风格吞吐
+- 同一工作负载下的 WAL 大小比值
+
+吞吐视角很重要,因为仅看 WAL 降幅并不能完整描述性能。公平的默认基线是:
+
+- `md + fpw=on`
+
+而 `md + fpw=off` 更适合作为敏感性 / 上界参考,不应被看作与正确性约束等价
+的基线。
+
+公共设置:
+
+- `checkpoint_timeout = 2min`
+- `max_wal_size = 20GB`
+- `shared_buffers = 50GB`
+- `logging_collector = on`
+- `runMins = 10`
+- `newOrderWeight = 45`
+- `paymentWeight = 43`
+- `deliveryWeight = 4`
+- `stockLevelWeight = 4`
+- `orderStatusWeight = 4`
+
+### TPCC 风格吞吐
+
+#### Checksums 关闭
+
+| 并发 | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| ---- | ------------: | -------------: | ---------------: |
+| 10 | 158709 | 154283 | 155781 |
+| 50 | 577005 | 626954 | 656353 |
+| 200 | 641899 | 981436 | 995635 |
+| 500 | 322660 | 943295 | 859058 |
+| 1000 | 275609 | 899631 | 729989 |
+
+#### Checksums 开启
+
+| 并发 | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| ---- | ------------: | -------------: | ---------------: |
+| 10 | 155754 | 152025 | 150606 |
+| 50 | 601974 | 635597 | 650844 |
+| 200 | 621176 | 1015923 | 938311 |
+| 500 | 316950 | 972795 | 729801 |
+| 1000 | 282713 | 891770 | 674865 |
+
+### WAL 大小比值
+
+- `md WAL bytes with full_page_writes=on`
+- 除以
+- `Umbra WAL bytes with full_page_writes=on`
+
+比值越大,表示 Umbra 在相同工作负载下生成的 WAL 越少。
+
+#### Checksums 关闭
+
+| 并发 | md WAL / Umbra WAL |
+| ---- | ------------------ |
+| 10 | 2.03 |
+| 50 | 2.51 |
+| 200 | 5.22 |
+| 500 | 6.90 |
+| 1000 | 6.55 |
+
+#### Checksums 开启
+
+| 并发 | md WAL / Umbra WAL |
+| ---- | ------------------ |
+| 10 | 1.82 |
+| 50 | 2.11 |
+| 200 | 3.81 |
+| 500 | 4.58 |
+| 1000 | 4.87 |
+
+把吞吐和 WAL 大小两组数字放在一起看,可以看到 Umbra 不只是降低了 WAL 体积;
+在相同工作负载下,它也回收了 ordinary checkpoint-boundary full-page-image
+压力带走的大部分吞吐。
+
+这些数字应被理解为:
+
+- 初步结果
+- 方向性信号
+- 尚不是完整 benchmark
+
+它们不应被读作关于 throughput、latency 或完整 replication/recovery cost 的
+最终结论。
diff --git a/doc/umbra/ARCHITECTURE.md b/doc/umbra/ARCHITECTURE.md
new file mode 100644
index 0000000000..7218b1a959
--- /dev/null
+++ b/doc/umbra/ARCHITECTURE.md
@@ -0,0 +1,437 @@
+# Umbra Architecture on PostgreSQL Master
+
+This document describes the current module boundaries and ownership rules of
+the PostgreSQL master Umbra PoC.
+
+The main architectural intent is:
+
+- upper PostgreSQL layers continue to speak in logical block numbers
+- Umbra translates mapped forks to physical block numbers underneath
+- the MAP subsystem owns persistent mapping facts
+- `umbra.c` owns runtime interpretation of those facts
+- `umfile.c` owns physical file operations
+
+## 1. Top-Level Layers
+
+Umbra is not a standalone engine next to PostgreSQL. It is a storage-manager
+variant integrated into:
+
+- `smgr`
+- WAL record assembly
+- redo entry points
+- checkpoint/writeback
+- postmaster background workers
+
+The main layers are:
+
+- `src/backend/storage/smgr/umbra.c`
+ - Umbra `smgr` implementation
+- `src/backend/storage/smgr/umfile.c`
+ - low-level physical file and segment manager
+- `src/backend/storage/map/*`
+ - shared MAP metadata, buffer, superblock, and background-maintenance logic
+- `src/backend/access/transam/xloginsert.c`
+ - producer-side remap-aware WAL assembly
+- `src/backend/access/transam/xlogutils.c`
+ - redo-side remap interpretation
+- `src/backend/access/transam/umbra_xlog.c`
+ - Umbra rmgr records for MAP lifecycle operations
+
+## 2. Relation-Local Umbra State
+
+`SMgrRelation` no longer carries a public Umbra-specific struct layout.
+
+Umbra keeps its relation-local state behind `reln->umbra_private`, where
+`umbra.c` stores:
+
+- a borrowed `UmbraFileContext *`
+- an explicit relation-local MAP state
+
+The current MAP state is not derived from multiple booleans anymore. It is an
+explicit state machine:
+
+- `UMBRA_MAP_POLICY_BYPASS_MAP`
+- `UMBRA_MAP_POLICY_SKIP_WAL_PENDING_MAP`
+- `UMBRA_MAP_POLICY_REQUIRE_MAP`
+
+That state is seeded by create/open/redo owner points and then consumed by the
+runtime access path.
+
+## 3. Metadata Fork
+
+Umbra uses an internal metadata fork to store:
+
+- block 0: MAP superblock
+- blocks 1..: MAP pages
+
+The metadata fork is:
+
+- internal to Umbra
+- dense, not sparse
+- special-cased in path and sync handling
+
+The metadata fork stores mapping state for all three mapped forks, `MAIN`,
+`FSM`, and `VM`, but it does not store their page contents. `MAIN/FSM/VM`
+remain relation forks addressed by logical block number at upper layers. MAP
+pages in the metadata fork only answer: for this fork and this logical block,
+which physical block is current?
+
+The layout is not three independent map forks. It is one metadata fork with
+fixed repeated groups:
+
+- block 0: MAP superblock
+- blocks 1..: repeated MAP page groups
+- each group starts with 1 FSM map page
+- then 1 VM map page
+- then 8192 MAIN map pages
+
+Each MAP page is a fixed-entry array, and each entry records one `lblk -> pblk`
+mapping for the corresponding fork. In that sense, the metadata fork is one
+internal MAP file with formula-defined `mapfsm`, `mapvm`, and `mapmain` logical
+regions.
+
+Its page format is also not the same as ordinary PostgreSQL data pages. Block
+0 is packed as a 512-byte MAP superblock sector. Blocks 1.. are compact
+fixed-entry MAP metadata pages, closer in spirit to CLOG-style metadata than
+to heap/index data pages. They therefore do not use ordinary data-page
+full-page-image semantics.
+
+This matters because internal metadata forks must not be passed through generic
+core helpers that only understand PostgreSQL's built-in forks. Metadata path
+construction stays inside Umbra-aware helpers such as `UmMetadataRelPathPerm()`
+and related wrappers.
+
+## 4. `umbra.c`: Runtime Storage Semantics
+
+`src/backend/storage/smgr/umbra.c` sits at the `smgr` boundary.
+
+It owns:
+
+- access classification for a relation/fork
+- mapped-vs-bypass decisions
+- logical-block to physical-block translation
+- publish/consume rules for mapped births and remaps
+- metadata-fork lifecycle wrappers
+- `FileTag` conversion for Umbra-managed files
+
+It does not own:
+
+- raw segment file management
+- MAP page layout
+- shared superblock table logic
+
+The important separation in this file is:
+
+1. classify runtime access state
+2. consume MAP facts
+3. issue physical I/O through `umfile`
+
+That separation is why thin metadata wrappers such as `UmMetadataExists()` and
+`UmMetadataRead()` are still useful: they keep the internal metadata fork
+details localized instead of scattering `UMBRA_METADATA_FORKNUM` and dense-fork
+assumptions across the tree.
+
+## 5. `umfile.c`: Physical File Layer
+
+`src/backend/storage/smgr/umfile.c` owns the physical side of Umbra storage.
+
+It is responsible for:
+
+- backend-local file context registry
+- segment open/close management
+- dense versus sparse physical existence semantics
+- physical read/write/extend/zeroextend
+- unlink, sync, delayed-unlink, and write-session helpers
+
+Current writeback architecture uses `UmFileWriteSession` so callers such as MAP
+flush pass only storage identity and block information. The MAP layer no
+longer needs to manipulate `UmbraFileContext` directly when flushing.
+
+Checkpoint/bgwriter writeback uses `umfile_write_session_begin_uncached()`,
+which intentionally avoids long-lived registry reuse in background processes.
+That prevents stale relation-local file state from being kept across relation
+lifecycle changes.
+
+## 6. MAP Subsystem
+
+The MAP subsystem is now split by functional domain.
+
+### 6.1 `map.c`
+
+This file now mainly owns:
+
+- mapping lookup
+- mapping allocation
+- mapping publication
+- truncate/lifecycle operations
+
+### 6.2 `mapbuf.c`
+
+This file owns MAP buffer-local state:
+
+- buffer state bits
+- pin/unpin
+- MAP buffer I/O ownership
+- using `MapMarkBufferDirty()` to ensure the corresponding metadata-fork block
+ exists before an ordinary MAP page is marked dirty
+
+The important rule here is:
+
+- ordinary MAP page modifications must be dirtied through
+ `MapMarkBufferDirty()`; if the metadata-fork block does not exist yet, that
+ path creates the MAP block first
+- checkpoint/writeback later writes existing blocks only
+
+This mirrors ordinary buffer-pool ownership more closely than the older design
+that allowed flush-time materialization.
+
+### 6.3 `mapflush.c`
+
+This file owns:
+
+- checkpoint flush of MAP buffers
+- checkpoint flush of superblocks
+- mapwriter background flush of ordinary MAP pages
+
+Current ownership rules are:
+
+- mapwriter flushes regular MAP pages only
+- checkpoint owns superblock flush
+- flush writes existing metadata blocks
+- flush no longer zeroextends missing metadata blocks on demand
+
+### 6.4 `mapbgproc.c`
+
+This file owns background maintenance:
+
+- preallocation
+- reclaim enqueue/dequeue work
+- compactor stepping
+- writer/compactor wakeup helpers
+
+`mapwriter` and `mapcompactor` are now driven directly from the MAP layer
+rather than through an `smgr` wrapper layer.
+
+### 6.5 `mapclock.c`
+
+This file owns:
+
+- clock sweep victim selection
+- MAP cache table
+- sync-start reporting
+
+### 6.6 `mapsuper.c`
+
+This file owns:
+
+- MAP superblock read/pack/CRC helpers
+- shared `MapSuperEntry` hash-table management
+- logical frontier, physical frontier, and allocator frontier updates
+- runtime extending state for fork materialization
+
+The current shared-entry model distinguishes:
+
+- logical EOF (`logical_nblocks`) in the on-disk superblock
+- materialized physical frontier (`phys_capacity` / physical nblocks) in the
+ on-disk superblock
+- committed allocator frontier (`next_free_phys_block`) in the on-disk
+ superblock
+- reservation frontier in `MapSuperEntry` runtime state only
+
+That split is important both for correctness and for WAL/redo.
+
+The allocator invariant is:
+
+- committed `next_free_phys_block <= reservation frontier`
+
+That should be asserted while holding `MapSuperEntry.lock`; reservation may run
+ahead in shared memory, but checkpoint-visible superblock state must not.
+
+### 6.7 `mapinit.c`
+
+This file owns:
+
+- shared-memory initialization
+- backend initialization
+- shared statistics
+- GUC-backed globals
+
+### 6.8 `mapinflight.c`
+
+This file owns in-flight remap ownership tracking.
+
+It uses per-MAP-buffer pending bits to serialize ownership of a logical MAP
+entry while a backend is preparing or publishing a new physical mapping. The
+chosen physical block remains backend-local until WAL insertion commits the
+owner state.
+
+This mechanism is about ownership and barriers, not durable publication. The
+durable superblock frontier must not advance here. Runtime reservation state
+may run ahead in shared memory, but committed `next_free_pblkno` is published
+later by WAL-owned commit/redo.
+
+## 7. Checkpoint and Writeback Ownership
+
+One of the largest recent architectural cleanups is the checkpoint/writeback
+contract.
+
+The current contract is:
+
+- synthesized ordinary MAP pages carry `MAPBUF_NOT_MATERIALIZED`
+- if the corresponding metadata-fork block does not exist yet, the first writer
+ that dirties such a page creates that MAP block under the page content lock
+- checkpoint/mapwriter later write the existing block only
+
+That means:
+
+- missing MAP-block creation is no longer hidden inside flush
+- `MapFlushBuffer()` uses write-existing semantics
+- background writeback no longer invents missing metadata blocks
+
+This is the same broad ownership rule as the normal buffer pool:
+
+- extension / MAP-block creation happens before writeback
+- writeback persists existing dirty state
+
+## 8. Background Processes
+
+Umbra currently adds two background workers under `postmaster` when built with
+`--with-umbra`:
+
+- `mapwriter`
+ - sync-start accounting
+ - ordinary MAP page flush
+ - preallocation
+- `mapcompactor`
+ - relocation and reclaim work
+
+These processes call directly into the MAP layer rather than through a generic
+`smgr` forwarding API. That keeps ownership clearer:
+
+- MAP background work belongs to the MAP subsystem
+- `smgr` remains the storage-manager boundary for relation storage calls
+
+## 9. WAL and Redo Boundaries
+
+The current boundary is:
+
+- `xloginsert.c`
+ - decides whether a block record carries remap metadata
+ - fills the remap header payload
+ - commits mapping/frontier publication after WAL insertion succeeds
+- `xlogutils.c`
+ - ensures metadata and MAP state for redo
+ - interprets remap-with-image and remap-without-image
+ - temporarily reconstructs the old mapping view when needed
+- `umbra_xlog.c`
+ - handles Umbra rmgr records such as `MAP_SET`, range remap, and reclaim
+
+The redo-entry layer owns remap interpretation because generic block-read
+helpers do not know enough about:
+
+- `has_remap`
+- `has_image`
+- `old_pblkno`
+- `new_pblkno`
+- frontier payload
+
+The detailed rules are described in [WAL_AND_REDO.md](./WAL_AND_REDO.md).
+
+## 10. Current Invariants
+
+The current code relies on these invariants:
+
+- metadata fork handling stays inside Umbra-aware helpers
+- runtime access state is explicit, not reconstructed from multiple booleans
+- ordinary MAP pages are materialized before flush
+- checkpoint/mapwriter write existing metadata blocks only
+- remap publication after WAL insertion is an owner action
+- redo owns redo-only metadata bootstrap and remap interpretation
+- skip-WAL dense-map WAL describes exact mapping/frontier facts, but does not
+ replace the existing data-file sync protocol
+- full-page images are still kept for explicit image owners and WAL
+ consistency checking
+
+These invariants should be treated as design constraints when reviewing later
+WAL-size optimizations. For example, `next_free_pblkno` is a global allocator
+frontier, not a value that can always be replaced by `new_pblkno + 1`.
+
+## 11. Architectural Choices
+
+The current PoC makes two deliberate architectural choices that are worth
+stating explicitly.
+
+### 11.1 Space Cleanup Policy
+
+Once logical block numbering is decoupled from physical placement, Umbra has at
+least two possible physical-space-management models:
+
+1. immediately reuse freed physical blocks, closer to PostgreSQL's traditional
+ reusable-space style; or
+2. let the physical frontier move forward, while treating reclaim/reuse as a
+ later background concern instead of a foreground allocation requirement.
+
+The current PoC chooses the second model on purpose.
+
+The reason is not that reuse is impossible, but that immediate reuse would push
+substantial allocator complexity back into the foreground path:
+
+- free-space accounting would become part of normal allocation decisions
+- remap publication would need tighter coupling with reuse eligibility
+- WAL/redo would need to preserve more allocator-state invariants
+- in-flight ownership and recovery races would become harder to reason about
+
+After Umbra has already decoupled logical identity from physical placement, the
+main value of that decoupling is simplicity of ownership and correctness. The
+foreground path therefore prefers monotonic physical advancement, while
+compaction/reclaim remain the place where old physical space is cleaned up and
+made reusable later.
+
+In other words:
+
+- immediate physical reuse is not the primary design goal of the PoC
+- correctness and simpler ownership are prioritized over aggressive reuse
+- reclaim exists, but it is intentionally a background policy rather than a
+ synchronous allocator contract
+
+### 11.2 Double-Buffering Boundary
+
+Umbra also deliberately keeps its buffering complexity inside Umbra-specific
+layers instead of trying to collapse everything into PostgreSQL's generic
+buffering model immediately.
+
+This means the PoC tolerates a double-buffering shape:
+
+- PostgreSQL keeps its ordinary upper buffer/cache behavior
+- Umbra keeps its own MAP buffers, superblock shared state, in-flight tracking,
+ and physical-file writeback state
+
+That choice is intentional for three reasons:
+
+1. it keeps Umbra-specific complexity inside Umbra rather than leaking remap,
+ allocator, and metadata-lifecycle rules into generic PostgreSQL buffer
+ ownership;
+2. it allows the project to measure and understand the system-level impact of
+ the extra buffering layer instead of assuming up front that it must be
+ eliminated; and
+3. it keeps the design open to future deployment models, including
+ cloud-oriented environments where storage-side services and local caching
+ boundaries may not match a traditional single-node assumption.
+
+This is a design trade-off, not a claim that double buffering is always ideal.
+The current PoC chooses modular isolation first, and leaves deeper buffer-model
+consolidation as a later optimization/design question.
+
+## 12. Open Architectural Debt
+
+The codebase is much cleaner than the earlier PG18-era branch, but a few
+medium size debts remain:
+
+- `mapsuper.c` is now the largest MAP module and could later be split into
+ on-disk-superblock helpers versus shared-super-entry management
+- `mapbgproc.c` still combines preallocation, reclaim, compaction, and wakeup
+ logic
+- `umfile.c` still mixes context/session management with raw segment/file
+ operations
+
+Those are now refactoring opportunities, not immediate ownership bugs.
diff --git a/doc/umbra/ARCHITECTURE_ZH.md b/doc/umbra/ARCHITECTURE_ZH.md
new file mode 100644
index 0000000000..343eb6eb73
--- /dev/null
+++ b/doc/umbra/ARCHITECTURE_ZH.md
@@ -0,0 +1,282 @@
+# Umbra 架构说明
+
+本文档是 `ARCHITECTURE.md` 的中文配套版本,说明当前 PostgreSQL master
+上的 Umbra 原型如何分层,以及各模块各自负责什么。
+
+## 1. 总体目标
+
+Umbra 不是 PostgreSQL 旁边的独立存储引擎,而是接在 PostgreSQL
+`storage manager` 边界上的一个存储管理原型。
+
+它的核心目标是:
+
+- 上层 PostgreSQL 继续只使用逻辑块号;
+- Umbra 在 `smgr` 下方把需要映射的 fork 翻译成物理块;
+- MAP 子系统持久化 `lblk -> pblk` 映射;
+- WAL 明确携带 remap 所需信息;
+- redo 能在恢复阶段确定性地重建映射关系和页面内容。
+
+## 2. 主要模块
+
+主要代码路径如下:
+
+- `src/backend/storage/smgr/umbra.c`
+ - Umbra 的 `smgr` 实现;
+ - 运行时访问策略;
+ - 逻辑块到物理块的翻译。
+- `src/backend/storage/smgr/umfile.c`
+ - 物理文件层;
+ - 段文件管理;
+ - dense/sparse 存在性判断;
+ - 同步、删除和延迟删除。
+- `src/backend/storage/map/`
+ - MAP 页;
+ - MAP buffer;
+ - superblock;
+ - checkpoint / mapwriter 刷盘;
+ - 预分配、回收、压实;
+ - in-flight owner 跟踪。
+- `src/backend/access/transam/xloginsert.c`
+ - WAL 生成端的 remap 判定;
+ - remap header 填充;
+ - WAL insert 成功后的映射发布。
+- `src/backend/access/transam/xlogutils.c`
+ - redo 端对 remap 的解释与执行。
+- `src/backend/access/transam/umbra_xlog.c`
+ - Umbra 自己的 rmgr 生命周期记录。
+
+## 3. relation-local 状态
+
+Umbra 的 relation-local 状态挂在 `SMgrRelation->umbra_private` 后面,不把
+Umbra 的内部结构暴露给普通 `smgr` 调用方。
+
+当前访问策略使用显式状态,而不是由多个布尔值拼装:
+
+- `UMBRA_MAP_POLICY_BYPASS_MAP`
+- `UMBRA_MAP_POLICY_SKIP_WAL_PENDING_MAP`
+- `UMBRA_MAP_POLICY_REQUIRE_MAP`
+
+这些状态由 create/open/redo 对应的 owner 点建立,再由运行时访问路径消费。
+
+## 4. metadata fork
+
+每个 Umbra relation 都有一个内部 metadata fork:
+
+- block 0 是 MAP superblock;
+- block 1.. 是普通 MAP 页。
+
+metadata fork 是 Umbra 自己的内部结构,不是普通 PostgreSQL 用户可见的
+fork。因此 metadata 的路径、同步、删除以及 dense/sparse 语义都必须留在
+Umbra-aware helper 中,不能泄漏到通用 fork helper。
+
+metadata fork 同时保存 `MAIN`、`FSM`、`VM` 三类 mapped fork 的映射状态,但
+不保存这些 fork 的页面内容。也就是说,`MAIN/FSM/VM` 仍然是上层按逻辑块号访问
+的 relation fork;metadata fork 里的 MAP 页只回答“这个 fork 的某个逻辑块现在
+对应哪个物理块”。
+
+具体布局不是三个独立的 map fork,而是同一个 metadata fork 中的固定分组:
+
+- block 0:MAP superblock;
+- block 1..:重复的 MAP page group;
+- 每个 group 先放 1 个 FSM map page;
+- 再放 1 个 VM map page;
+- 再放 8192 个 MAIN map page。
+
+每个 MAP page 都由固定大小 entry 组成,每个 entry 记录对应 fork 中一个逻辑块
+的 `lblk -> pblk` 映射。因此可以把 metadata fork 理解成一个内部 MAP 文件,
+里面按稳定公式切出了 `mapfsm`、`mapvm`、`mapmain` 三类逻辑区域。
+
+它的页面格式也不同于普通 PostgreSQL data page。block 0 是按 `512B`
+sector 打包的 MAP superblock;block 1.. 是由固定大小 entry 组成的紧凑
+MAP metadata page,语义上更接近 CLOG 这类 metadata,而不是 heap/index
+data page。因此它们不使用普通 data-page full-page-image 语义。
+
+## 5. MAP 子系统分工
+
+当前 MAP 子系统按职责拆分如下:
+
+- `map.c`
+ - 查找;
+ - 分配;
+ - 映射发布;
+ - truncate / 生命周期处理。
+- `mapbuf.c`
+ - MAP buffer 状态;
+ - pin / unpin;
+ - buffer I/O 所有权;
+ - 通过 `MapMarkBufferDirty()` 保证普通 MAP 页标脏前,metadata fork 中已有
+ 对应物理 block。
+- `mapflush.c`
+ - checkpoint 刷盘;
+ - mapwriter 刷盘;
+ - superblock 刷盘。
+- `mapbgproc.c`
+ - 预分配;
+ - 回收;
+ - compactor;
+ - writer / compactor 唤醒。
+- `mapclock.c`
+ - 时钟扫描;
+ - MAP 缓存表;
+ - sync-start 统计。
+- `mapsuper.c`
+ - superblock 的打包、解包和 CRC;
+ - 共享 `MapSuperEntry` 表;
+ - 逻辑 EOF、物理容量和分配前沿。
+- `mapinit.c`
+ - 共享内存初始化;
+ - backend 初始化;
+ - 由 GUC 驱动的全局状态。
+- `mapinflight.c`
+ - in-flight remap owner;
+ - 写屏障;
+ - pending 标记。
+
+## 6. superblock 状态拆分
+
+superblock 和共享 entry 里同时存在几类不同状态,这些状态不能混在一起:
+
+- `logical_nblocks`
+ - 逻辑 EOF;
+ - 持久化在 superblock 中。
+- `phys_capacity`
+ - 已经完成物理物化的容量;
+ - 持久化在 superblock 中。
+- `next_free_pblkno`
+ - 已提交的分配前沿;
+ - 持久化在 superblock 中;
+ - 由 WAL-owned commit / redo 发布。
+- reservation frontier
+ - 运行时的预留前沿;
+ - 只存在于 `MapSuperEntry` 的共享状态里;
+ - 不直接落盘。
+
+关键不变量是:
+
+```text
+committed next_free_pblkno <= runtime reservation frontier
+```
+
+也就是说,预留前沿可以在内存里领先,但 checkpoint 可见的已提交前沿不能跑到
+WAL 已经发布的状态前面去。
+
+## 7. checkpoint 和回写
+
+当前回写规则是:
+
+- 修改普通 MAP 页的路径必须通过 `MapMarkBufferDirty()` 标脏;如果 metadata
+ fork 中还没有对应物理 block,这条路径会先创建 MAP block;
+- checkpoint / mapwriter 只写已经存在的脏 MAP 页;
+- 刷盘阶段不创建缺失的 MAP 页;
+- superblock 的刷盘归 checkpoint 所有。
+
+这套规则刻意靠近 PostgreSQL 的普通 buffer pool:
+
+- 创建缺失的 MAP block 属于写回之前的动作;
+- 回写只负责把已经存在的脏状态持久化。
+
+## 8. mapwriter 和 mapcompactor
+
+Umbra 目前有两个后台 worker:
+
+- `mapwriter`
+ - 统计 MAP 分配压力;
+ - 刷普通 MAP 页;
+ - 做预分配。
+- `mapcompactor`
+ - 负责物理迁移;
+ - 负责回收。
+
+`mapwriter` 会扫描 `MapSuperEntry` 判断是否需要预分配,但它不负责把脏
+superblock 持久化。脏 superblock 仍由 checkpoint 负责刷盘。
+
+## 9. WAL / redo 边界
+
+WAL / redo 的边界如下:
+
+- `xloginsert.c`
+ - 生成 remap header;
+ - 在 WAL insert 成功后发布映射和 frontier。
+- `xlogutils.c`
+ - 在 redo 端解释 remap;
+ - 区分 remap-with-image 和 remap-without-image。
+- `umbra_xlog.c`
+ - 记录显式的 MAP 生命周期事件。
+
+redo 端必须理解 remap,因为普通的 block-read helper 并不知道:
+
+- 当前记录是否带 remap;
+- 是否带 image;
+- 旧物理基线是什么;
+- 新物理目标是什么;
+- frontier payload 是什么。
+
+## 10. 架构取舍
+
+当前原型有两项刻意保留的架构选择,需要明确写出来。
+
+### 10.1 空间清理策略
+
+一旦逻辑块号和物理块位置解耦,Umbra 至少有两种物理空间管理方式:
+
+1. 像 PostgreSQL 传统可复用空间那样,尽量立即复用已经释放的物理块;
+2. 让物理前沿持续向前推进,把 reclaim / reuse 作为后续后台清理策略,而不是
+ 前台分配路径的同步要求。
+
+当前原型有意选择第二种。
+
+原因不是“不能复用”,而是如果前台路径立即承担复用,就会把一整套复杂度重新
+拉回分配主路径:
+
+- free-space accounting 会进入正常分配决策;
+- remap 发布会和 reuse eligibility 更紧地耦合;
+- WAL / redo 需要维护更多分配状态不变量;
+- in-flight 所有权与恢复竞争会更难推理。
+
+既然 Umbra 已经把逻辑身份和物理位置解耦,那么这层解耦带来的一个核心收益,
+就是所有权边界和正确性规则更简单。因此当前原型选择:
+
+- 前台路径优先让物理块单调向前推进;
+- reclaim / compaction 负责后续清理旧物理空间;
+- 复用存在,但它属于后台策略,不是前台同步 contract。
+
+换句话说:
+
+- 当前原型的首要目标不是“立即复用物理块”;
+- 当前优先级是正确性和更简单的所有权边界;
+- reclaim 确实存在,但它被刻意放在后台,而不是前台分配主路径上。
+
+### 10.2 双层 buffer 边界
+
+Umbra 还刻意保留了一层内部缓冲复杂度,而不是一开始就试图把所有东西直接压进
+PostgreSQL 的通用 buffer 模型。
+
+这意味着当前原型接受一种双层缓冲形态:
+
+- PostgreSQL 保留上层通用 buffer / cache 行为;
+- Umbra 保留自己的 MAP buffer、superblock 共享状态、in-flight 跟踪,以及
+ 物理文件回写状态。
+
+这样做有三个原因:
+
+1. 尽量把 Umbra 特有的 remap、分配器、metadata 生命周期复杂度封装在
+ Umbra 内部,而不是泄漏到 PostgreSQL 的通用 buffer 所有权模型中;
+2. 先实际观察双层 buffer 对整个系统的影响,而不是预设它一定必须被消除;
+3. 给未来的部署模型留空间,包括云原生场景里可能出现的存储侧服务与本地缓存
+ 边界;这些场景未必符合传统单机、单层 buffer 的假设。
+
+这是一种工程取舍,不是说双层 buffer 永远最优。当前原型的选择是:
+
+- 先保证模块隔离和语义清晰;
+- 更深入的 buffer 模型合并,留作后续优化和设计问题。
+
+## 11. 当前架构债务
+
+当前仍有一些工程债:
+
+- `mapsuper.c` 仍然偏大;
+- `mapbgproc.c` 同时包含预分配、回收、压实和唤醒逻辑;
+- `umfile.c` 同时包含 context / session 以及底层段文件操作;
+- compactor / reclaim 还不是最终的生产级空间管理。
+
+这些目前是后续重构点,不是当前原型最核心的正确性阻塞项。
diff --git a/doc/umbra/PROTOTYPE.md b/doc/umbra/PROTOTYPE.md
new file mode 100644
index 0000000000..fd9ea67654
--- /dev/null
+++ b/doc/umbra/PROTOTYPE.md
@@ -0,0 +1,86 @@
+# Umbra Prototype and Repository Navigation
+
+This document explains how the current PostgreSQL master PoC relates to the
+earlier PostgreSQL 12.2 shadow prototype.
+
+## 1. Repository Layout
+
+The public repository is intended to keep both the old prototype and the
+current master-port implementation in one place, separated by Git branches
+rather than by copying source trees into subdirectories.
+
+Repository:
+
+- `https://github.com/nayishan/postgre_umbra`
+
+Expected branch roles:
+
+- `umbra-poc-pgmaster`
+ - PostgreSQL master based Umbra PoC
+ - full implementation branch intended for community reading
+ - includes MAP metadata, WAL/redo integration, mapwriter, compactor, tests,
+ and documentation
+- `shadow-pg12-archive`
+ - archived PostgreSQL 12.2 shadow prototype
+ - useful for understanding the original minimal idea without the full
+ master-port integration burden
+
+The important rule should remain:
+
+- one repository
+- separate branches
+- clear README navigation
+- no mixing PostgreSQL 12.2 prototype files into the PostgreSQL master branch
+
+## 2. Why Keep The Prototype
+
+The PostgreSQL 12.2 shadow prototype is useful because it shows the original
+idea with fewer host-tree integration details.
+
+It is not a substitute for the master PoC, but it helps answer questions such
+as:
+
+- what is the minimal logical-to-physical mapping idea?
+- why does Umbra live below upper PostgreSQL logical block addressing?
+- how did the MAP state-machine idea evolve?
+- which parts are core design and which parts are master-port engineering?
+
+The master PoC is much larger because it must deal with:
+
+- current `smgr` boundaries
+- WAL block registration
+- redo paths
+- checkpoint/writeback
+- relation lifecycle
+- skip-WAL relations
+- background maintenance
+- TAP recovery tests
+
+## 3. How To Read The Two Branches
+
+Read the branches in this order if the goal is to understand the design:
+
+1. Read the repository `README.md` on `umbra-poc-pgmaster`.
+2. Read `doc/umbra/ARCHITECTURE.md` for the current module boundaries.
+3. Read `doc/umbra/WAL_AND_REDO.md` for the WAL and recovery model.
+4. Read `doc/umbra/REVIEW_GUIDE.md` for suggested review entry points.
+5. Read `doc/umbra/UMBRA_FPW_STORY_ZH.md` if the Chinese design story is useful
+ context.
+6. If anything is still unclear, go back to the shadow prototype for the
+ minimal mapping idea.
+
+The prototype should be treated as background material. The master PoC is the
+branch that should be used for current testing and review.
+
+## 4. Development Transparency
+
+The original design direction, boundary choices, and state-machine reasoning
+come from the author. The PostgreSQL 12.2 shadow prototype was used as an
+important reference while building the master PoC.
+
+The master-port implementation also used AI coding assistance extensively for
+repetitive implementation work and for code shaped after both the prototype
+and existing PostgreSQL subsystems. That assistance was not sufficient to
+reason independently about database-kernel concurrency, WAL ordering, or
+recovery correctness. The difficult part was repeatedly checking the logic,
+finding incorrect assumptions, and correcting the implementation.
diff --git a/doc/umbra/PROTOTYPE_ZH.md b/doc/umbra/PROTOTYPE_ZH.md
new file mode 100644
index 0000000000..2dafc78efd
--- /dev/null
+++ b/doc/umbra/PROTOTYPE_ZH.md
@@ -0,0 +1,74 @@
+# Umbra 原型与仓库导航
+
+本文档是 `PROTOTYPE.md` 的中文配套版本,说明当前 PostgreSQL master 上的
+Umbra 原型,与早期 PostgreSQL 12.2 shadow 原型之间的关系。
+
+## 1. 仓库结构
+
+建议把早期原型和当前原型放在同一个 GitHub 仓库里,通过分支隔离。
+
+仓库:
+
+- `https://github.com/nayishan/postgre_umbra`
+
+建议分支:
+
+- `umbra-poc-pgmaster`
+ - 基于 PostgreSQL master 的完整 Umbra 原型;
+ - 面向社区阅读和测试;
+ - 包含 MAP 元数据、WAL / redo、mapwriter、compactor、测试和文档。
+- `shadow-pg12-archive`
+ - PostgreSQL 12.2 的 shadow 原型归档分支;
+ - 适合理解最初的核心映射思路。
+
+原则如下:
+
+- 一个仓库;
+- 不同分支做物理隔离;
+- 不把 PG12 原型文件混入 master 原型分支;
+- 用根 `README` 提供清晰导航。
+
+## 2. 为什么保留原型
+
+PG12 shadow 原型的价值,在于展示最小逻辑:
+
+- 为什么要在 `smgr` 下方做逻辑块到物理块的映射;
+- 最原始的 MAP 状态机是什么;
+- 哪些是核心设计;
+- 哪些是迁移到 PostgreSQL master 之后才出现的工程复杂度。
+
+master 原型会更复杂,因为它必须处理:
+
+- 当前的 `smgr` 边界;
+- WAL block registration;
+- redo;
+- checkpoint / 回写;
+- relation 生命周期;
+- skip-WAL relation;
+- 后台维护;
+- recovery TAP。
+
+## 3. 阅读顺序
+
+建议按下面的顺序阅读:
+
+1. 先看 `umbra-poc-pgmaster` 分支上的仓库根目录 `README.md`;
+2. 后续等文档集导入该分支后,再看 `UMBRA_FPW_STORY_ZH.md`,理解更完整的
+ 设计演化叙事;
+3. 后续等文档集导入该分支后,再看 `ARCHITECTURE.md`,理解模块边界;
+4. 后续等文档集导入该分支后,再看 `WAL_AND_REDO.md`,理解正确性的
+ owner model;
+5. 后续等文档集导入该分支后,再看 `REVIEW_GUIDE.md`,找到代码入口;
+6. 如果仍有不理解的地方,再回头看 shadow 原型,理解最小映射思路。
+
+原型是背景材料,master 原型才是当前测试和审阅的对象。
+
+## 4. 实现透明度
+
+核心架构、边界选择和状态机推演来自作者;PG12 shadow 原型也是 master 原型的
+重要参考。
+
+在 `master-port` 的实现过程中,也大量使用了 AI 编码助手来处理重复实现和迁移
+工作,并参考原型以及 PostgreSQL 现有实现来组织代码。但 AI 并不具备独立理解
+数据库内核并发、WAL 顺序和恢复正确性的能力;真正困难的部分,是持续审查逻辑、
+识别错误假设并修正实现。
diff --git a/doc/umbra/REVIEW_GUIDE.md b/doc/umbra/REVIEW_GUIDE.md
new file mode 100644
index 0000000000..0c5f8bf803
--- /dev/null
+++ b/doc/umbra/REVIEW_GUIDE.md
@@ -0,0 +1,210 @@
+# Umbra Review Guide
+
+This note is for reviewers and maintainers reading the Umbra PostgreSQL master
+PoC patch series. It does not replace the architecture and WAL/redo documents.
+It describes how to read the patch and which invariants should be checked
+first.
+
+## 1. Patch Shape
+
+The patch is not intended to hide subsystem boundaries. The main review units
+are:
+
+- build flag and storage-manager dispatch
+- internal metadata fork and physical file layer
+- MAP buffer, superblock, in-flight owner, and write-barrier subsystem
+- WAL block-header remap encoding
+- redo-time remap interpretation
+- skip-WAL dense-map bootstrap
+- background mapwriter/mapcompactor maintenance
+- recovery and regression tests
+
+For a line-by-line review, it is usually better to read the patch by those
+units rather than by file order.
+
+## 2. What Umbra Changes
+
+Umbra keeps PostgreSQL's upper-layer logical block addressing. The storage
+manager translates mapped forks from logical block numbers to physical block
+numbers underneath.
+
+The mapped forks are:
+
+- `MAIN_FORKNUM`
+- `FSM_FORKNUM`
+- `VISIBILITYMAP_FORKNUM`
+
+The persistent mapping state lives in an internal metadata fork owned by
+Umbra. That metadata fork is not a normal PostgreSQL page fork and must not
+enter generic shared-buffer, full-page-image, checksum, or page-LSN paths.
+
+## 3. Where to Start Reading
+
+Start with:
+
+- `src/backend/storage/smgr/smgr.c`
+ - storage-manager dispatch and the `--with-umbra` boundary
+- `src/backend/storage/smgr/umbra.c`
+ - runtime access policy and logical-to-physical translation
+- `src/backend/storage/smgr/umfile.c`
+ - physical file operations below Umbra
+- `src/backend/storage/map/`
+ - MAP metadata, reservations, writeback, and background work
+- `src/backend/access/transam/xloginsert.c`
+ - producer-side remap decisions and header encoding
+- `src/backend/access/transam/xlogreader.c`
+ - remap header parsing
+- `src/backend/access/transam/xlogutils.c`
+ - redo-time remap interpretation
+- `src/backend/access/transam/umbra_xlog.c`
+ - Umbra rmgr records
+
+## 4. Core Correctness Invariants
+
+Review these invariants before focusing on micro-optimizations:
+
+- WAL publication wins before committed MAP publication.
+- A pending reservation chooses a physical block but does not publish the
+ logical-to-physical mapping or committed allocator frontier.
+- First-born pages publish logical EOF explicitly through WAL-owned remap or
+ range remap state.
+- Ordinary remap-without-image redo consumes the old physical baseline before
+ publishing the new physical mapping.
+- `next_free_pblkno` is the committed allocator frontier, not necessarily
+ `new_pblkno + 1`.
+- The runtime reservation frontier lives in `MapSuperEntry` shared state, not
+ in the on-disk superblock.
+- committed `next_free_pblkno <= reservation frontier` must hold under the
+ shared-entry lock and should be asserted in the implementation.
+- MAP superblock logical EOF, materialized physical frontier, and allocator
+ frontier are separate facts.
+- Checkpoint and mapwriter write existing MAP metadata blocks; they do not
+ materialize missing MAP blocks during flush.
+- Redo owns redo-only metadata bootstrap for mapped forks.
+
+## 5. WAL Review Checklist
+
+Umbra has two WAL-visible mechanisms.
+
+Block-reference remap metadata is attached to ordinary WAL records with
+`BKPBLOCK_HAS_REMAP`:
+
+- full remap header:
+ - `old_pblkno`
+ - `new_pblkno`
+ - `logical_nblocks`
+ - `next_free_pblkno`
+- compact birth header:
+ - `new_pblkno`
+ - `logical_nblocks`
+ - `next_free_pblkno`
+- ordinary slim header:
+ - `old_pblkno`
+ - `new_pblkno`
+ - `next_free_pblkno`
+
+Umbra rmgr records are separate lifecycle records:
+
+- `XLOG_UMBRA_MAP_SET`
+- `XLOG_UMBRA_RANGE_REMAP`
+- `XLOG_UMBRA_RANGE_REMAP_COMPACT`
+- `XLOG_UMBRA_SKIP_WAL_DENSE_MAP`
+- `XLOG_UMBRA_RECLAIM_UNLINK`
+
+The important review point is that these are complementary, not substitutes.
+Block-header remap is for replaying ordinary WAL block content against the
+right physical baseline. Umbra rmgr records are for explicit MAP lifecycle
+events outside the ordinary block-reference owner.
+
+## 6. Full-Page Image Boundaries
+
+Umbra does not globally disable full-page writes.
+
+It replaces the ordinary checkpoint-boundary image path with remap metadata
+when the record is eligible for automatic remap. Images are still kept when
+the caller explicitly owns an image or when consistency checking requires one.
+
+Known conservative cases:
+
+- `REGBUF_FORCE_IMAGE` keeps image semantics.
+- `XLR_CHECK_CONSISTENCY` keeps verification images.
+- `XLOG_FPI_FOR_HINT` keeps the PostgreSQL hint-image rule and does not use
+ Umbra remap today.
+
+That last point is deliberate. Hint-bit FPI optimization would require a
+separate checksum/torn-page protection design; it is not a header encoding
+optimization.
+
+## 7. Skip-WAL Dense Map
+
+Skip-WAL relations are handled as a dense physical build while the relation is
+still in the skip-WAL pending window.
+
+The WAL anchor is `XLOG_UMBRA_SKIP_WAL_DENSE_MAP`. For each encoded fork it
+means:
+
+- `[0, nblocks)` is dense
+- `pblk == lblk` in that range
+- `logical_nblocks = nblocks`
+- `physical_nblocks = nblocks`
+- `next_free_pblkno = nblocks`
+
+The record does not encode empty forks. An entry with `nblocks == 0` has no
+mapping work and should not be produced.
+
+The record is not a data-file fsync replacement and is not a generic
+`MAP_SUPER_INIT`. The existing skip-WAL sync protocol still owns durability;
+the dense-map record gives redo an exact mapping/frontier anchor.
+
+## 8. What Is Intentionally Not Solved Here
+
+The current patch does not try to solve every possible WAL byte optimization.
+
+It intentionally does not add:
+
+- a tiny birth header that drops both frontier fields
+- per-block remap variant tags inside mixed records
+- remap optimization for checksum-driven hint FPIs
+- range relocation WAL for compactor moves
+- a default-on storage-manager behavior
+
+Those are separate follow-up designs. The current patch favors deterministic
+ownership and reviewable replay semantics over maximum header compression.
+
+## 9. Test Baseline
+
+The current correctness baseline is the md/Umbra matrix below. When switching
+between modes in the same source tree, clean the previous build first.
+
+```sh
+make distclean
+./configure
+make
+make check
+make -C src/test/recovery check
+
+make distclean
+./configure --with-umbra
+make
+make check
+make -C src/test/recovery check
+```
+
+Umbra-only recovery tests are expected to skip in md mode and run in
+`--with-umbra` mode.
+
+The torn-page remap test is especially important:
+
+- `src/test/recovery/t/074_umbra_torn_page_remap.pl`
+ - md negative control with `full_page_writes=off`
+ - Umbra positive recovery path with `full_page_writes=on`
+ - recovery verification uses an ordered relation digest, not just row count
+
+## 10. Longer Reference Material
+
+Reviewers should also read:
+
+- [ARCHITECTURE.md](./ARCHITECTURE.md)
+- [WAL_AND_REDO.md](./WAL_AND_REDO.md)
+- [PROTOTYPE.md](./PROTOTYPE.md)
+- [UMBRA_FPW_STORY_ZH.md](./UMBRA_FPW_STORY_ZH.md)
diff --git a/doc/umbra/REVIEW_GUIDE_ZH.md b/doc/umbra/REVIEW_GUIDE_ZH.md
new file mode 100644
index 0000000000..b1de862a2e
--- /dev/null
+++ b/doc/umbra/REVIEW_GUIDE_ZH.md
@@ -0,0 +1,133 @@
+# Umbra 审阅指南(中文版)
+
+本文档是 `REVIEW_GUIDE.md` 的中文配套版本,用于说明阅读当前 Umbra 原型
+patch 序列时,先看什么、重点看什么。
+
+## 1. patch 的关注点
+
+审阅时最值得关注的是:
+
+- 架构边界是否合理;
+- `smgr` 接入是否清晰;
+- MAP 元数据的所有权边界是否清楚;
+- WAL / remap 的所有权模型是否正确;
+- redo 是否具备确定性;
+- checkpoint / 回写边界是否正确;
+- 测试是否覆盖了核心风险。
+
+## 2. 建议阅读顺序
+
+建议先从这些文件入手:
+
+- `src/backend/storage/smgr/smgr.c`
+ - `storage manager` 的分派逻辑;
+ - `--with-umbra` 的边界。
+- `src/backend/storage/smgr/umbra.c`
+ - 运行时访问策略;
+ - 逻辑块到物理块的翻译。
+- `src/backend/storage/smgr/umfile.c`
+ - 物理文件层;
+ - 段文件、同步、删除、dense/sparse 语义。
+- `src/backend/storage/map/`
+ - MAP 元数据、buffer、superblock、刷盘和后台工作。
+- `src/backend/access/transam/xloginsert.c`
+ - WAL 生成端的 remap 判定。
+- `src/backend/access/transam/xlogreader.c`
+ - remap header 的解析。
+- `src/backend/access/transam/xlogutils.c`
+ - redo 阶段对 remap 的解释。
+- `src/backend/access/transam/umbra_xlog.c`
+ - Umbra 自己的 rmgr 记录。
+
+## 3. 核心正确性不变量
+
+优先审阅这些不变量:
+
+- WAL 的发布必须先于已提交 MAP 的发布;
+- pending 预留只负责选择物理块,不发布已提交映射;
+- first-born 必须显式发布逻辑 EOF;
+- 不带 image 的 remap redo 必须先消费旧物理基线;
+- `next_free_pblkno` 表示已提交的分配前沿;
+- 运行时预留前沿只存在于共享内存中;
+- 已提交的 `next_free_pblkno <= reservation frontier`;
+- 逻辑 EOF、物理容量、分配前沿是不同事实;
+- checkpoint / mapwriter 只写已经存在的 MAP 元数据块;
+- redo 拥有只在恢复阶段需要的 metadata bootstrap。
+
+## 4. Full-Page Image 的边界
+
+Umbra 并没有全局关闭 full-page writes。
+
+它只是在满足条件的 checkpoint 边界普通场景里,用 remap 元数据替代默认的
+image 路径。
+
+保守边界如下:
+
+- `REGBUF_FORCE_IMAGE` 保留 image;
+- `XLR_CHECK_CONSISTENCY` 保留校验 image;
+- `XLOG_FPI_FOR_HINT` 当前不走 Umbra remap。
+
+hint-bit 的 FPI 优化需要单独的 checksum / torn-page 保护设计,不应该混入
+当前的 header 编码优化里。
+
+## 5. Skip-WAL Dense Map
+
+skip-WAL relation 在 pending 窗口内按 dense 物理布局处理。
+
+`XLOG_UMBRA_SKIP_WAL_DENSE_MAP` 表示:
+
+- `[0, nblocks)` 是 dense;
+- `pblk == lblk`;
+- `logical_nblocks = nblocks`;
+- `physical_nblocks = nblocks`;
+- `next_free_pblkno = nblocks`。
+
+这个记录不是 `fsync` 的替代品;它只是 redo 阶段的 mapping / frontier 锚点。
+
+## 6. 当前不解决的问题
+
+当前 patch 不试图解决所有 WAL 字节数优化。
+
+明确不作为当前目标的内容包括:
+
+- 更小的 birth header;
+- mixed record 中每个 block 各自独立的 variant tag;
+- 基于 checksum 的 hint FPI remap 优化;
+- 更激进的 compactor range relocation WAL;
+- 默认开启 storage manager;
+- 完整的生产级空间管理。
+
+当前优先级是确定性的所有权边界,以及可审阅的回放语义。
+
+## 7. 测试基线
+
+完整正确性矩阵如下。在同一个源码树里切换构建模式时,先清理上一次构建。
+
+```sh
+make distclean
+./configure
+make
+make check
+make -C src/test/recovery check
+
+make distclean
+./configure --with-umbra
+make
+make check
+make -C src/test/recovery check
+```
+
+重点测试包括:
+
+- `src/test/recovery/t/074_umbra_torn_page_remap.pl`
+ - 在 md 模式下充当反向对照;
+ - 在 Umbra 模式下验证:即使 remap 后的新物理页被破坏,恢复仍然能够成功;
+ - 恢复后检查的是按顺序计算的 relation 摘要,而不只是 row count。
+
+## 8. 审阅结论应该关注什么
+
+- 架构层面的反馈;
+- WAL / remap 所有权模型的反馈;
+- redo 正确性的反馈;
+- `smgr` 边界的反馈;
+- checkpoint / 回写边界的反馈。
diff --git a/doc/umbra/UMBRA_FPW_STORY.md b/doc/umbra/UMBRA_FPW_STORY.md
new file mode 100644
index 0000000000..d15750cf15
--- /dev/null
+++ b/doc/umbra/UMBRA_FPW_STORY.md
@@ -0,0 +1,708 @@
+# Umbra FPW-to-Remap Design Story
+
+[Chinese](./UMBRA_FPW_STORY_ZH.md)
+
+## 1. Background: Which FPW Cost Umbra Targets
+
+PostgreSQL currently relies on full-page writes (FPW) for crash-recovery
+correctness. The basic rule is that after a checkpoint, the first update to a
+page logs a full-page image into WAL and uses that image as the new recovery
+baseline.
+
+Umbra does not try to remove every full-page image. The current implementation
+targets the ordinary checkpoint-boundary image path. For ordinary data-page WAL
+records that satisfy the automatic remap conditions, Umbra replaces the default
+full-page image with remap-aware recovery metadata. The following conservative
+paths still keep image ownership:
+
+- `REGBUF_FORCE_IMAGE`
+- `XLR_CHECK_CONSISTENCY`
+- `XLOG_FPI_FOR_HINT`
+
+So the more accurate goal is:
+
+- do not claim that all FPIs disappear
+- provide an alternative recovery-baseline representation for the ordinary
+ checkpoint-boundary case
+
+This is worth exploring because the stock `md` ordinary checkpoint-boundary
+image path repeatedly binds several costs together:
+
+- the first-dirty path after a checkpoint introduces an extra owner path, but
+ the more important point is the I/O cost behind it
+- WAL grows substantially because of full-page images, increasing WAL write and
+ sync pressure
+- data-file write amplification and WAL-side write amplification stack together
+- for update-heavy workloads, this I/O pressure repeats in every checkpoint
+ interval
+
+Umbra is not a local tweak to that path. It tries to take over the path with a
+different way to express the recovery baseline.
+
+## 2. Current Scope, Non-Goals, and Terminology
+
+This section fixes the scope, non-goals, and terminology.
+
+Current scope:
+
+- use remap-based recovery metadata to take over the default ordinary
+ checkpoint-boundary FPW image path
+- discuss a PostgreSQL `storage manager` / physical storage-layer prototype,
+ not a new table AM or a general storage engine
+- use `P1-P9` to describe the semantic split of the current PoC branch, not to
+ describe an exact one-to-one mapping between arbitrary working branches and
+ patch numbers
+
+Current non-goals and conservative boundaries:
+
+- do not claim to remove every full-page image; `REGBUF_FORCE_IMAGE`,
+ `XLR_CHECK_CONSISTENCY`, and `XLOG_FPI_FOR_HINT` still keep image ownership
+- do not claim that compactor, AIO, primary/standby physical-page alignment,
+ `CREATE DATABASE` copy strategy, or explicit range-born protocol are fully
+ engineered and closed
+- do not treat `md + fpw=off` as a correctness-equivalent baseline
+
+Terms used throughout this document:
+
+- `birth`: the first durable `lblk -> pblk` relation for a logical page
+- `remap`: moving an existing logical page to a new physical page
+- `mapset`: direct publication of one mapping relation
+- ordinary checkpoint-boundary FPW: the ordinary first-dirty-after-checkpoint
+ path that defaults to a page image
+- reclaim boundary: the physical boundary up to which reclaim / unlink may
+ safely advance
+- `compactor`: the background organizer that scans sparse regions and moves
+ still-live pages away
+- `reclaim`: the lifecycle action that enters the safe unlink path once a
+ physical region is confirmed to have no live mappings
+
+## 3. Design Boundary: Storage Metadata and WAL Own the Recovery Core
+
+Umbra is deliberately narrow in scope. It does not try to rewrite PostgreSQL's
+execution layer or spread changes into large upper-layer abstractions.
+
+More precisely, the current implementation is not a new table AM and not a
+standalone general-purpose storage engine. It is closer to a prototype at the
+PostgreSQL `storage manager` / physical storage layer. Upper layers still use
+logical block numbers; below `smgr`, Umbra provides `lblk -> pblk` translation
+for mapped forks.
+
+From the crash-recovery perspective, the correctness core converges on two
+layers:
+
+- storage metadata
+- WAL
+
+Storage metadata has two object types:
+
+- per-page map entry
+- fork-level superblock
+
+The "crash-recovery core" here does not mean that only these three modules
+participate in recovery. It means that after a crash, redo needs three kinds of
+minimal durable truth to restore correct page contents:
+
+- map entry: which physical block a logical block should currently map to
+- superblock: fork-level boundary state, such as logical EOF, physical capacity,
+ and committed allocator frontier
+- WAL: which map-entry, superblock, and physical-page lifecycle changes were
+ atomically published, and in what order redo must replay them
+
+As long as those three facts stay consistent in redo, an ordinary remap update
+can be recovered as "old physical page plus WAL delta", without using a
+checkpoint-boundary full-page image as the recovery baseline.
+
+Runtime concurrency correctness has one more explicit mechanism:
+
+- inflight claim / barrier
+
+This mechanism is not WAL encoding and not durable truth in the recovery log. It
+serializes publication order among foreground remap, background compactor
+relocation, and physical writes, so that the durable truth later written to WAL
+is itself valid. A more precise split is:
+
+- durable truth for crash recovery mainly comes from `map entry + superblock +
+ WAL`
+- runtime concurrency correctness also explicitly depends on inflight / barrier
+
+## 4. Map Entry: Page-Level Mapping Truth
+
+Umbra's core abstraction is to split logical page identity from physical
+placement.
+
+In this model:
+
+- the logical page is the data-page identity that upper layers care about
+- the physical page is only the on-disk location currently carrying that logical
+ page
+- the map entry records the current `lblk -> pblk` relation
+
+Therefore, the map entry tells us which physical page currently stores one
+logical page. It is the page-level local truth.
+
+## 5. Superblock: Fork-Level Global Truth
+
+A per-page map entry is not enough. Many correctness properties are not
+expressible by one mapping alone; they depend on the boundary state of the
+whole fork. The superblock owns that state.
+
+The superblock is Umbra's fork-level correctness anchor. It maintains at least
+these facts:
+
+- logical boundary of the fork, such as `logical_nblocks`
+- committed physical allocation boundary, such as `next_free_pblkno`
+- materialized physical capacity
+- the safe boundary for reclaim / unlink
+- fork-level frontier facts needed by redo
+
+One easy-to-miss distinction is the runtime reservation frontier. Runtime code
+also needs a reservation frontier in `MapSuperEntry` shared state to allocate
+new `pblk` values concurrently in foreground backends. That frontier is not
+flushed to disk and does not participate in checkpoint. The on-disk
+`next_free_pblkno` in the superblock only represents the committed frontier.
+The implementation should maintain and assert:
+`committed next_free <= reservation frontier`.
+
+In short:
+
+- map entry owns page-level mapping truth
+- superblock owns global fork-boundary truth
+
+Many extend, truncate, reclaim, unlink, and redo correctness properties
+ultimately depend on the superblock.
+
+## 6. WAL: Atomic Publication and Recoverability
+
+Storage metadata describes state, but state alone is not enough. To take over
+the ordinary checkpoint-boundary image path, Umbra must publish and replay the
+following actions atomically:
+
+- birth
+- remap
+- mapset
+- related fork-level frontier changes, such as committed frontier, logical
+ size, and capacity
+
+That is why the WAL layer exists in this design.
+
+For the ordinary checkpoint-boundary path handled by Umbra, recovery
+correctness no longer depends on whether the record carries a full-page image.
+It depends on:
+
+- whether map entry and superblock state are correct
+- whether those state changes were recorded and replayed by WAL as atomic
+ events
+
+This must not be expanded into "Umbra no longer depends on images in any
+recovery path". Conservative image owners still exist, and the paths listed
+above keep PostgreSQL's original image semantics.
+
+### 6.1 Lifecycle of the Old Physical Page Before Remap
+
+The old physical page in an ordinary remap is not a temporary page that can be
+immediately discarded or reused. Before the remap is published, it is the
+committed physical baseline pointed to by the current map entry. It is also
+the old baseline that no-image delta redo may need to read.
+
+The lifecycle is:
+
+- before remap publication, `old_pblk` is still the current durable mapping for
+ `lblk`; even if a backend has chosen `new_pblk`, `old_pblk` cannot be treated
+ as free space
+- after successful WAL insert, the remap record publishes the `old_pblk ->
+ new_pblk` transition as an atomic event; normal runtime state switches the map
+ entry to `new_pblk`
+- during crash recovery, for a no-image remap, redo first reads the old physical
+ page through `old_pblk`, applies the WAL delta on top of that old baseline,
+ and only then publishes `new_pblk`
+- after remap publication, `old_pblk` is no longer the current mapping of that
+ logical page, but it still does not enter foreground reuse; it only becomes a
+ candidate for background space cleanup, constrained by live mappings, reclaim
+ boundary, checkpoint, and redo semantics
+
+So Umbra does not turn "overwrite the old page" into "immediately reuse the old
+page". It changes the recovery baseline: the old physical page remains
+available when a WAL record needs it, while the new physical page becomes the
+current mapping through remap.
+
+## 7. Foreground Policy: Allocate New Pages, Do Not Dispose of Old Pages
+
+If the goal is to make the ordinary checkpoint-boundary first-dirty path
+lighter, then immediate old-page reclaim, reusable-page search, and synchronous
+space cleanup should not be pushed back into the foreground path.
+
+The foreground tradeoff in Umbra is:
+
+- the foreground always takes a new physical page
+- the old physical page is not immediately reused in the foreground path; the
+ foreground only publishes the new mapping and does not dispose of the old page
+- the foreground does not perform immediate space cleanup
+
+That is, foreground allocation is closer to a monotonically advancing frontier.
+Old pages are not rewritten in the hot path, and the foreground does not tidy
+them up opportunistically.
+
+This does not mean the system never processes old pages. It means the
+foreground does not own old-page disposal or long-term space convergence. Later
+cleanup, reclaim, and unlink are background policy decisions. Under high
+capacity pressure, the foreground may still trigger one-shot preallocation, but
+that does not move long-term cleanup back into the hot path.
+
+## 8. MAP Buffer and Mapwriter: Keep New Complexity in MAP Metadata
+
+Once logical pages are split from physical pages, MAP metadata becomes durable
+metadata in its own right. It needs:
+
+- its own cache
+- its own I/O state
+- its own flush and extension maintenance path
+
+Therefore, in addition to PostgreSQL's existing data-page buffer pool, Umbra
+adds a buffer cache dedicated to MAP metadata. This double-buffering shape
+cannot be completely removed, but the new buffering is mostly contained in the
+MAP metadata layer rather than spreading into a second generic data-page cache.
+
+More directly: `mapwriter` can be viewed as a MAP-metadata background writer
+modeled after PostgreSQL `bgwriter`, with one extra duty: physical
+preallocation for mapped forks near the low-water mark. The current contract is
+closer to:
+
+- ordinary MAP pages are materialized on first dirty, not later by mapwriter or
+ checkpoint
+- checkpoint and mapwriter only flush ordinary MAP metadata blocks that already
+ exist
+- superblock flush is still owned by checkpoint
+- mapwriter owns ordinary MAP flush
+- mapwriter also owns background preallocation / physical capacity expansion
+- under high low-water pressure, the foreground may still perform one-shot
+ preallocation
+
+Therefore, mapwriter can be understood as "`bgwriter` for MAP metadata plus a
+physical-capacity preallocator". It is not the owner of logical EOF, not the
+owner of superblock checkpoint flush, and not a generic data-page writer.
+
+## 9. Compactor: Long-Term Space Convergence
+
+The benefit of monotonic foreground allocation is a simple hot path. The cost
+is that physical layout becomes sparse over time. Opportunistic reclaim alone
+is not enough for long-term space convergence, so a background process is needed
+to move live pages out of sparse extents / segments and eventually create
+conditions for reclaim and segment unlink.
+
+That process is the compactor. It is not the crash-recovery core, but it is the
+background mechanism that makes "foreground only publishes new mappings and does
+not dispose of old pages" sustainable over time.
+
+Compactor and reclaim are not synonyms. Compactor scans, chooses candidate
+extents, relocates live pages, and advances the reclaim boundary when
+conditions allow. Reclaim is the later lifecycle action: only when a segment is
+below the reclaim boundary and has no live mapping references does the physical
+unlink get handed to the sync-request / checkpointer path.
+
+The current compactor's first goal is not "clean as aggressively as possible".
+It is "avoid interfering with the foreground". It is a best-effort, bounded,
+back-off-on-contention background organizer, not an aggressive reclaim sweeper.
+
+At a high level, it works like this:
+
+- scan MAP and count live block density by extent to find low-live-ratio
+ candidate regions
+- only process regions below the reclaim boundary and explicitly avoid the
+ current physical tail
+- relocate live pages in candidate extents by switching their mappings to new
+ physical pages
+- after an extent / segment becomes empty, defer real reclaim / unlink to the
+ later queue
+
+From the non-interference perspective, the important constraints are:
+
+- when foreground allocation pressure rises, compactor skips the round instead
+ of competing with foreground work
+- each round processes only a bounded number of relations and relocation moves
+- hot-path locking uses conditional acquire heavily; if a superblock or MAP
+ buffer is busy, the current implementation tends to skip rather than wait
+- relocation commits only if the old mapping is still the current published
+ truth; if the foreground already won an update, compactor abandons that move
+- real segment unlink is not executed synchronously by compactor; it is deferred
+ through the reclaim / sync-request path
+
+The result is that compactor behaves more like "gently yielding to foreground
+work" than "maximizing background cleanup throughput". Its first job is to
+avoid noticeably slowing foreground allocation and writes.
+
+## 10. Inflight / Barrier: Foreground-Background Migration Concurrency
+
+Once both foreground code and compactor can migrate the same logical page, the
+system needs shared state to describe that a migration for this logical block is
+already in progress. That is the role of inflight / barrier.
+
+This explicit mechanism exists because compactor relocation is currently a raw
+physical copy, not a shared-buffer-aware copy. Without extra serialization,
+foreground remap, background relocation, and physical writes could conflict
+around the same `lblk`.
+
+Inflight / barrier is not a space-management policy. It is the concurrency
+control for migration publication:
+
+- prevent concurrent publication of multiple new mappings for the same `lblk`
+- make the loser wait for stable committed MAP truth instead of borrowing
+ someone else's owner-local target
+- reduce foreground/background conflicts to owner / claim / barrier semantics
+
+The more precise split is:
+
+- `map entry + superblock + WAL` define the durable state truth that crash
+ recovery must restore
+- inflight / barrier defines how runtime code safely publishes those state
+ changes
+
+The former answers "what must redo ultimately restore"; the latter answers "who
+may publish this change during concurrent execution".
+
+## 11. File Deletion and Segment Lifecycle
+
+File deletion in Umbra is not a normal unlink problem. It is about when a
+segment lifecycle reaches a safe deletion boundary.
+
+There are at least two cases:
+
+- truncate / drop driven deletion
+- reclaim deletion triggered after compactor cleanup
+
+The stable external contract is not the complete pending-state rule set. It is
+the following boundary:
+
+- the superblock maintains the reclaim boundary
+- compactor uses published live mappings and live-map scans to decide whether a
+ candidate region still has live pages, and moves those live pages away
+- reclaim registers later physical unlink only when a segment is below the
+ reclaim boundary and has no live mapping references
+- real physical unlink is deferred through PostgreSQL's sync-request /
+ checkpointer path
+- redo must accept this lifecycle boundary instead of deciding only from "is the
+ file empty now"
+
+Inflight / pending state still affects internal correctness, but it is better
+treated as an implementation detail rather than the main criterion to explain in
+community-facing text. The key external points are:
+
+- unlink is not "delete when empty"
+- unlink is constrained by reclaim boundary, live mappings, checkpoint, and redo
+ semantics
+
+## 12. Current Verification Status
+
+The current PoC verification target is: this owner / recovery model is
+executable on the covered paths. It is not a proof that all boundaries are
+exhaustively covered.
+
+The basic verification already includes:
+
+- `make check` in `md` mode
+- `src/test/recovery check` in `md` mode
+- `make check` in `Umbra` mode
+- `src/test/recovery check` in `Umbra` mode
+
+Umbra-specific recovery TAP coverage further covers topics directly related to
+this design story:
+
+- MAP superblock / map fork policy / mapwriter activity
+- truncate / remap / 2PC remap / skip-WAL dense map redo
+- reclaim / internal segment unlink / compactor relocation
+- range remap zeroextend / ordinary slim block remap / compact birth block remap
+
+These tests support this statement:
+
+- the PoC is no longer only a design sketch; on the currently covered paths, it
+ has a minimal compile / regression / recovery loop
+
+But the following points cannot be claimed as fully closed based only on the
+current tests:
+
+- stronger proof and coverage for primary/standby physical-page alignment when
+ checkpoint cadence differs
+- `CREATE DATABASE` copy strategy: `FILE_COPY` is supported; `WAL_LOG` is not
+ supported yet, and with the Umbra storage manager enabled it falls back to
+ `FILE_COPY`
+- explicit `range-born / batch mapping publish` owner model
+- dedicated verification for internal metadata fork / MAP fork crossing
+ `RELSEG_SIZE` segment boundaries, for example beyond `1GB`
+- more complete native AIO paths and stronger methodology stress tests
+
+## 13. Performance Observations
+
+This section provides directional performance signals for the current PoC. It
+does not attempt to make a strict benchmark claim. The methodology is still
+thin: it lacks complete hardware details, repeated runs, error / variance
+ranges, and ablation for individual mechanisms. The data below is better read
+as a directional observation, not as a formal community performance conclusion.
+
+The safer performance story should not treat `md + fpw=off` as a
+correctness-equivalent baseline. The fair default baseline is:
+
+- `md + fpw=on`
+
+This point is better used as a mechanism upper-bound / sensitivity point:
+
+- `md + fpw=off`
+
+On `master`, under the same workload, we compared three modes:
+
+- `md + fpw=on`
+- `md + fpw=off`
+- `Umbra + fpw=on`
+
+Common settings:
+
+- `checkpoint_timeout = 2min`
+- `max_wal_size = 20GB`
+- `shared_buffers = 50GB`
+- `logging_collector = on`
+- `runMins = 10`
+- `newOrderWeight = 45`
+- `paymentWeight = 43`
+- `deliveryWeight = 4`
+- `stockLevelWeight = 4`
+- `orderStatusWeight = 4`
+
+Raw throughput results are listed first.
+
+`checksum=off`
+
+| clients | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| --- | ---: | ---: | ---: |
+| 10 | 158709 | 154283 | 155781 |
+| 50 | 577005 | 626954 | 656353 |
+| 200 | 641899 | 981436 | 995635 |
+| 500 | 322660 | 943295 | 859058 |
+| 1000 | 275609 | 899631 | 729989 |
+
+`checksum=on`
+
+| clients | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| --- | ---: | ---: | ---: |
+| 10 | 155754 | 152025 | 150606 |
+| 50 | 601974 | 635597 | 650844 |
+| 200 | 621176 | 1015923 | 938311 |
+| 500 | 316950 | 972795 | 729801 |
+| 1000 | 282713 | 891770 | 674865 |
+
+For WAL volume, under the same transaction count, the ratio
+`WAL(md + fpw=on) / WAL(Umbra + fpw=on)` is:
+
+`checksum=on`
+
+| clients | `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` |
+| --- | ---: |
+| 10 | 1.82 |
+| 50 | 2.11 |
+| 200 | 3.81 |
+| 500 | 4.58 |
+| 1000 | 4.87 |
+
+`checksum=off`
+
+| clients | `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` |
+| --- | ---: |
+| 10 | 2.03 |
+| 50 | 2.51 |
+| 200 | 5.22 |
+| 500 | 6.90 |
+| 1000 | 6.55 |
+
+These numbers show more directly that under the same transaction count, Umbra
+does not only recover throughput lost to ordinary checkpoint-boundary FPW; it
+also substantially reduces the corresponding WAL-volume pressure. The gap
+widens as concurrency rises.
+
+From the raw numbers, `Umbra + fpw=on` shows clear and stable improvement over
+`md + fpw=on`:
+
+- with `checksum=off`, the improvement at 50 / 200 / 500 / 1000 clients is
+ about `+13.8% / +55.1% / +166.2% / +164.9%`
+- with `checksum=on`, the improvement at 50 / 200 / 500 / 1000 clients is about
+ `+8.1% / +51.1% / +130.3% / +138.7%`
+
+At 10 clients, all three results are close. That looks more like low-concurrency
+noise or a non-FPW-dominated region. The gap opens above 50 clients, where
+ordinary checkpoint-boundary I/O cost starts to accumulate repeatedly.
+
+At the same time, `Umbra + fpw=on` is close to, but does not fully reach, the
+`md + fpw=off` upper bound at most points:
+
+- with `checksum=off`, Umbra is already very close to the upper bound at 50 and
+ 200 clients, but remains clearly behind `md + fpw=off` at 500 and 1000 clients
+- with `checksum=on`, this is more visible, suggesting that Umbra recovers much
+ of the ordinary-FPW-related I/O cost but does not consume all remaining system
+ cost
+
+The current data supports only a qualitative conclusion:
+
+- Umbra recovers a large part of the throughput lost by `md + fpw=on` on the
+ ordinary FPW path; the safer interpretation is recovery of related I/O cost
+- this benefit is visible with both `checksum=on` and `checksum=off`
+- `md + fpw=off` should only be treated as a sensitivity reference for "where
+ the system upper bound might be if this FPW cost is removed", not as a
+ semantic peer baseline
+
+The data is not enough for fine-grained attribution, such as "which foreground
+hot path contributes exactly how much" or "which mechanism is the fixed main
+source of benefit". It supports:
+
+- a large part of the benefit is related to ordinary checkpoint-boundary FPW
+ being taken over by remap metadata, thereby recovering related I/O cost
+
+Further attribution would require dedicated ablation and a fuller methodology
+for WAL write / sync pressure, data write amplification, preallocation, and
+other sub-mechanisms.
+
+## 14. Open Engineering Work
+
+The following items should be described as follow-up work, not as completed
+capabilities:
+
+1. `compactor` engineering: the framework exists, but background convergence
+ efficiency and directory-discovery cost control are not fully engineered; the
+ sparse-segment discovery cost problem should not be described as solved; for
+ the PoC, this is engineering follow-up and does not block the minimal
+ remap/recovery loop.
+2. `CREATE DATABASE` copy strategy: PostgreSQL's existing `FILE_COPY` directory
+ / file copy path is supported; `WAL_LOG`, which copies database contents
+ block-by-block and logs each block to WAL, is not supported yet. With the
+ Umbra storage manager enabled, it falls back to `FILE_COPY`. This is an
+ explicit limitation and should not be overclaimed.
+3. `superblock shared-entry replacement`: the current shape is still closer to
+ allocate/free than replacement/eviction. In practice, capacity pressure is
+ mitigated by increasing `map_superblocks`. This remains engineering
+ follow-up.
+4. `AIO` integration: the necessary adaptation exists, but it is not a complete
+ Umbra-native rewrite. The async I/O side should not be described as fully
+ closed.
+5. `range-born / batch mapping publish`: there is no explicit upper-layer
+ interface yet, so the current implementation mainly uses conservative `smgr`
+ fallback. Multi-block extension still depends on compatibility with older
+ AM/WAL ordering.
+6. `primary/standby physical-page alignment`: the issue is identified. The
+ current implementation adds a stronger publication / flush constraint such
+ as `FlushOneBuffer()` on the local no-image remap redo path and has local
+ recovery coverage, but primary/standby physical-page alignment should not be
+ described as systematically closed.
+
+Compressed into one sentence: the current PoC has established the core
+correctness / recovery loop and has a compile / regression / recovery shape, but
+it still carries explicitly marked host-tree follow-up work.
+
+## 15. Semantic Layers of the Current PoC
+
+This section only describes the semantic boundaries that each layer should own
+if the current PoC is organized as `P1-P9`. It is not an exact mapping from
+arbitrary working branches to commit numbers, and it does not describe release
+cadence.
+
+The split should not be understood as a mechanical directory split. It is a
+state-machine and owner-boundary split: earlier layers establish the minimal
+recoverable mechanism, while later layers add checkpoint, mapwriter, compactor,
+and other engineering capabilities.
+
+The purpose is for each layer to state which correctness owner or engineering
+boundary it introduces:
+
+- earlier layers should mostly establish base mechanisms, without mixing in
+ later engineering follow-up
+- later layers introduce WAL/redo, checkpoint, mapwriter, compactor, recovery
+ tests, and related capabilities
+- incomplete parts should be explicitly marked as follow-up rather than implied
+ as complete
+
+For the current PoC branch, a natural semantic split is `P1-P9`:
+
+- `P1`: establish the `smgr` implementation boundary, add the `--with-umbra`
+ choice point, and keep the ordinary `md` path unchanged
+- `P2`: introduce the `umfile` physical file layer and metadata storage
+ primitives; cover physical files, segments, create / unlink, read / write /
+ extend / truncate
+- `P3`: introduce the metadata disk format and identity-mapping bootstrap so
+ that metadata fork, superblock layout, and initial mapping state stand on
+ their own
+- `P4`: introduce the shared-memory MAP cache and checkpoint flush foundation,
+ so MAP metadata cache, materialization, dirty, and flush semantics stand on
+ their own
+- `P5`: introduce MAP access policy, logical-to-physical translation, and the
+ materialization contract, so `MAIN/FSM/VM` keep logical block numbers above
+ `smgr` while resolving `lblk -> pblk` below it
+- `P6`: introduce WAL records, mapped birth, and the redo state machine; build
+ the minimal WAL/redo owners for `MAP_SET`, truncate, metadata lifecycle, and
+ skip-WAL pending
+- `P7`: introduce ordinary remap, block-reference remap, and
+ checkpoint-boundary FPW replacement, closing the alternative representation
+ for the ordinary checkpoint-boundary image path
+- `P8`: add checkpoint / mapwriter writeback and physical preallocation, giving
+ clear owners to MAP metadata writeback, background preallocation, and one-shot
+ foreground preallocation under low-water pressure
+- `P9`: introduce the compactor framework and foreground non-interference
+ policy, converging inflight / barrier, reclaim, delayed unlink, and compactor
+ relocation into a background organization framework
+
+The point of this order is that earlier layers build the correctness owner
+model, while later layers handle engineering pressure. Host-tree integration
+points such as `CREATE DATABASE` copy strategy, AIO, and primary/standby
+physical-page alignment should not be hidden as if the core mechanism already
+closed them. They are better described as explicit follow-up.
+
+Thus, `P1-P9` expresses only the semantic boundary of the current PoC: which
+parts belong to the minimal correctness loop, which parts are engineering
+enhancements, and which parts are still compatibility fallback or follow-up.
+Tests and documentation should also belong to their related semantic layer,
+instead of being flattened into a generic "test / documentation layer".
+
+## 16. Summary
+
+Umbra does not claim that PostgreSQL no longer needs full-page images. It also
+should not be broadly described as a new storage engine. More accurately, it is
+a remap-based recovery-baseline representation for the ordinary
+checkpoint-boundary FPW path at the PostgreSQL `storage manager` / physical
+storage layer. Its core is:
+
+- split logical page identity from physical placement in the storage layer
+- use map entries for page-level mapping truth
+- use the superblock for fork-level global truth
+- use WAL to publish state changes atomically and recoverably
+
+On top of that:
+
+- the foreground always allocates a new physical page and does not dispose of old
+ pages in the hot path
+- mapwriter mainly smooths MAP metadata in the background, rather than owning
+ all expansion
+- compactor owns long-term space convergence
+- inflight / barrier keeps foreground/background migration and physical writes
+ concurrency-safe
+- file deletion and segment lifecycle are constrained by reclaim boundary, live
+ mappings, checkpoint, and redo semantics
+
+The value of this design is not a single local optimization and not simply
+"turning FPW off". It challenges the cost model bound to ordinary
+checkpoint-boundary FPW while trying to keep the crash-recovery semantics
+required by `md + fpw=on`.
+
+## Appendix: Implementation Transparency
+
+The implementation process should be transparent. Umbra's core architecture,
+boundary definitions, and key state-machine reasoning come from the author's own
+design and prototyping work around PostgreSQL storage / WAL / recovery
+semantics. The author also maintains the early `shadow` validation prototype:
+<https://github.com/nayishan/postgre_umbra/tree/shadow-pg12-archive>.
+
+To expand the prototype into the current PoC, the author used AI coding
+assistants such as Codex extensively for concrete implementation, boilerplate
+expansion, and local refactoring. That work heavily depends on the prior logic
+analysis, the `shadow` prototype, and the shapes and call order of existing
+PostgreSQL implementation.
+
+The responsibility boundary is also important: core design, boundary definition,
+and key logic decisions are the author's responsibility; AI mainly accelerates
+tedious implementation details. Current AI systems still cannot independently
+reason about database-kernel concurrency timing, owner models, or
+crash-recovery semantics. Some areas may therefore still show style
+inconsistency or require further engineering convergence. The current status is
+PoC, not a finished product with final host-tree polish.
diff --git a/doc/umbra/UMBRA_FPW_STORY_ZH.md b/doc/umbra/UMBRA_FPW_STORY_ZH.md
new file mode 100644
index 0000000000..f42a42a334
--- /dev/null
+++ b/doc/umbra/UMBRA_FPW_STORY_ZH.md
@@ -0,0 +1,500 @@
+# Umbra 用 remap 替代 ordinary checkpoint-boundary FPW 路径的中文说明
+
+[English](./UMBRA_FPW_STORY.md)
+
+## 1. 背景:Umbra 挑战的是哪一段 FPW 成本
+
+PostgreSQL 当前依赖 full-page writes(FPW)来保证崩溃恢复正确性。其基本做法是:在 checkpoint 之后,某个页面第一次被修改时,把整页镜像写入 WAL,用它作为新的恢复基线。
+
+Umbra 当前实现挑战的,不是“全局取消所有 full-page image”,而是 ordinary checkpoint-boundary 这条默认 image 路径。对满足自动 remap 条件的普通数据页 WAL 记录,Umbra 用 remap-aware recovery metadata 取代默认 full-page image;但以下保守路径仍然保留 image 语义:
+
+- `REGBUF_FORCE_IMAGE`
+- `XLR_CHECK_CONSISTENCY`
+- `XLOG_FPI_FOR_HINT`
+
+因此,Umbra 当前更准确的目标是:
+
+- 不去宣称“所有 FPW 都消失了”
+- 而是为 ordinary checkpoint-boundary case 提供另一种恢复基线表达方式
+
+这套做法之所以值得尝试,是因为 stock md 的 ordinary checkpoint-boundary image path 确实绑定了几类反复出现的成本:
+
+- checkpoint 边界后的 first-dirty 路径会引入一条额外的 owner path,但更值得强调的是它背后绑定的 I/O 成本
+- WAL 会因为 full-page image 明显膨胀,从而带来更高的 WAL 写入与同步压力
+- 数据文件侧的写放大和 WAL 侧的写放大会叠加出现
+- 在更新密集型 workload 下,这类 I/O 压力会在每个 checkpoint 区间反复出现
+
+Umbra 的目标,不是对这条路径做局部微调,而是尝试用不同的恢复基线表达方式来接管它。
+
+## 2. 当前范围、非目标和术语约定
+
+下面先把当前范围、非目标和术语约定说清楚。
+
+当前范围是:
+
+- 目标是用 remap-based recovery metadata 接管 ordinary checkpoint-boundary FPW 的默认 image 路径
+- 讨论对象是 PostgreSQL `storage manager` / 物理存储层原型,而不是新的 table AM 或通用“存储引擎”
+- 本文里的 `P1-P9` 用来描述当前 PoC 分支的语义拆分,不用来描述任意工作分支与 patch 编号的一一对应关系
+
+当前非目标或保守保留边界是:
+
+- 不宣称取消所有 full-page image;`REGBUF_FORCE_IMAGE`、`XLR_CHECK_CONSISTENCY`、`XLOG_FPI_FOR_HINT` 等路径仍保留 image owner
+- 不把 compactor、AIO、主备物理页一致性、`CREATE DATABASE` 复制路径、显式 range-born 协议说成已经工程化收敛的能力
+- 不把 `md + fpw=off` 当成 correctness-equivalent baseline
+
+本文中几个高频术语的约定是:
+
+- `birth`:一个逻辑页第一次获得持久 `lblk -> pblk` 关系
+- `remap`:已有逻辑页切换到新的物理页
+- `mapset`:直接发布一条映射关系
+- ordinary checkpoint-boundary FPW:checkpoint 之后 ordinary first-dirty 默认走 image 的那类路径
+- reclaim boundary:可以安全推进 reclaim / unlink 的物理边界
+- `compactor`:后台整理器,负责扫描稀疏区域并把仍然 live 的页面搬走
+- `reclaim`:生命周期动作,负责在某段物理空间确认没有 live mapping 后进入
+ 安全删除 / unlink 路径
+
+## 3. 设计边界:崩溃恢复核心收敛在 storage metadata 和 WAL
+
+Umbra 的实现范围是刻意克制的。它不试图重写 PostgreSQL 的执行层,也不希望把变更扩散到更高层的大面积抽象中。
+
+更准确地说,当前实现讨论的不是一个新的 table AM,也不是独立于 PostgreSQL 的通用“存储引擎”;它更接近 PostgreSQL `storage manager` / 物理存储层上的一个原型。上层仍然使用逻辑块号,而 Umbra 在 `smgr` 之下为 mapped fork 提供 `lblk -> pblk` 的翻译。
+
+从 crash recovery 的角度看,它的 correctness core 主要收敛在两层:
+
+- storage metadata
+- WAL
+
+其中,storage metadata 又分成两类对象:
+
+- per-page map entry
+- fork-level superblock
+
+这里说的 crash-recovery core,不是指只有这三个模块参与恢复,而是指崩溃后
+redo 要恢复出正确页面内容时,最小的持久事实来自三类信息:
+
+- map entry:说明某个逻辑块当前应该对应哪个物理块
+- superblock:说明这个 fork 的全局边界状态,例如 logical EOF、physical
+ capacity、已提交 allocator frontier
+- WAL:说明哪些 map entry / superblock / 物理页生命周期变化已经被原子发布,
+ 以及 redo 必须按什么顺序重放它们
+
+只要这三类事实在 redo 中保持一致,ordinary remap 更新就可以按“旧物理页 +
+WAL delta”恢复,而不需要 checkpoint-boundary full-page image 作为恢复基线。
+
+但如果讨论的是运行时并发正确性,当前实现还显式依赖一层额外机制:
+
+- inflight claim / barrier
+
+它不属于 WAL 编码本身,也不是恢复日志里的持久真相;它负责把前台 remap、
+后台 compactor relocation、以及物理写入之间的发布顺序串行化,保证之后写入
+WAL 的持久事实本身是成立的。因此更准确的说法是:
+
+- crash-recovery core 的持久事实主要来自 `map entry + superblock + WAL`
+- runtime concurrency correctness 还显式依赖 inflight / barrier
+
+## 4. map entry:负责单页映射真相
+
+Umbra 的核心抽象,是把逻辑页身份和物理放置拆开。
+
+在这套模型下:
+
+- 逻辑页代表上层真正关心的数据页身份
+- 物理页只是当前承载该逻辑页内容的落盘位置
+- map entry 负责记录当前 `lblk -> pblk` 的对应关系
+
+因此,单个页面当前映射到哪个物理页,由 map entry 给出。它解决的是单页级别的局部真相。
+
+## 5. superblock:负责 fork 级全局真相
+
+仅有 per-page map entry 还不够,因为很多正确性并不是单个映射能表达的,而是整个 fork 的边界状态决定的。这部分由 superblock 承担。
+
+superblock 是 Umbra 里的 fork-level correctness anchor,负责维护至少以下几类状态:
+
+- 当前 fork 的逻辑边界,例如 `logical_nblocks`
+- 已经提交的物理分配边界,例如 `next_free_pblkno`
+- 当前已经 materialized 到哪里的物理容量状态
+- reclaim / unlink 可以推进到哪条安全边界
+- redo 需要恢复的 fork-level frontier facts
+
+这里需要额外强调一个容易混淆的点:运行时还需要一个只存在于
+`MapSuperEntry` shared state 里的 reservation frontier,用来给并发前台分配
+新 `pblk`。这个 frontier 不落盘、不参与 checkpoint;落盘进入 superblock 的
+`next_free_pblkno` 只能表示 committed frontier。实现上应满足并显式断言:
+`committed next_free <= reservation frontier`。
+
+换句话说:
+
+- map entry 负责单页映射真相
+- superblock 负责全局边界真相
+
+很多 extend、truncate、reclaim、unlink、redo 相关的正确性,最终都依赖 superblock 才能成立。
+
+## 6. WAL:负责动作的原子发布与可恢复性
+
+storage metadata 负责描述状态本身,但仅有状态还不够。Umbra 要接管 ordinary checkpoint-boundary image path,就必须保证以下动作可以被原子地发布,并在 redo 中被一致地重建:
+
+- birth
+- remap
+- mapset
+- 与之相关的 committed frontier、logical size、capacity 等 fork-level 状态推进
+
+这就是 WAL 层存在的理由。
+
+对被 Umbra 接管的 ordinary checkpoint-boundary 路径来说,恢复正确性不再依赖“这条记录是否携带 full-page image”,而是依赖:
+
+- map entry 和 superblock 描述的状态是否正确
+- 这些状态变化是否被 WAL 作为原子事件记录并重放
+
+但这不应被扩写成“Umbra 的所有恢复路径都不再依赖 image”。保守 image owner 仍然存在,上面列出的几类路径仍保持 PostgreSQL 原有的 image 语义。
+
+### 6.1 remap 前旧物理页的生命周期
+
+ordinary remap 里的旧物理页不是一个可以立刻丢掉或复用的临时页。它在 remap
+发生前,是 map entry 当前指向的 committed physical baseline,也是无 image
+delta redo 可能需要读取的旧基线。
+
+这条生命周期可以按下面几步理解:
+
+- remap 发布前:`old_pblk` 仍然是 `lblk` 的当前持久映射;即使 backend 已经
+ 选出了 `new_pblk`,也不能把 `old_pblk` 当成空闲页处理
+- WAL insert 成功后:remap record 把 `old_pblk -> new_pblk` 的转换作为原子
+ 事件发布;正常运行时 map entry 会切到 `new_pblk`
+- crash recovery 时:如果这是 no-image remap,redo 先通过 `old_pblk` 读取旧
+ 物理页,把 WAL delta 作用在这个旧基线上,然后再发布 `new_pblk`
+- remap 发布后:`old_pblk` 不再是该逻辑页的当前映射,但它也不会进入前台
+ 复用路径;它只能作为后台空间整理的候选对象,受 live mapping、reclaim
+ boundary、checkpoint 和 redo 语义共同约束
+
+所以,Umbra 不是把“旧页覆盖写”改成“旧页立即复用”。它真正改变的是恢复基线:
+旧物理页在 WAL record 需要时仍然保留为可读取基线,新物理页则通过 remap
+成为新的当前映射。
+
+## 7. 前台策略:关键路径只分配新页,不处置旧页
+
+如果目标是把 ordinary checkpoint-boundary 的 first-dirty 路径变轻,就不应该再把“即时回收旧页、寻找可复用页、同步整理空间”这些工作塞回前台。
+
+Umbra 在前台路径上的取舍是:
+
+- 前台总是拿新物理页
+- 旧物理页在前台路径上不会被即时复用;前台只发布新映射,不负责处置旧页
+- 前台不负责即时空间整理
+
+也就是说,前台更接近单调前进的 frontier 分配。旧页不会在热路径上被重新拿来写,前台也不会顺手把旧页整理掉。
+
+这不是说系统永远不处理旧页,而是说前台不承担旧页处置和空间收敛。后续是否整理、何时 reclaim / unlink,由后台策略决定。在容量压力较高时,前台仍可能触发一次 one-shot preallocation,但它不会把长期空间整理重新拉回热路径。
+
+## 8. MAP buffer 和 mapwriter:新增复杂度主要被限制在 MAP 元数据层
+
+逻辑页和物理页拆开之后,MAP 元数据成为一等持久元数据。它需要:
+
+- 自己的缓存
+- 自己的 I/O 状态
+- 自己的刷写和扩张维护路径
+
+因此,在 PostgreSQL 原有的数据页 buffer pool 之外,Umbra 额外增加了一套专门服务于 MAP 元数据的 buffer cache。这个双层 buffer 问题不能完全消除,但新增的 buffering 主要被限制在 MAP 元数据层,而不是扩散成第二套通用数据页缓存。
+
+这里也可以更直白地描述:`mapwriter` 可以看成是仿照 PostgreSQL `bgwriter` 的一套 MAP 后台写回机制,但它比 `bgwriter` 额外多承担了一项工作:在低水位附近为 mapped fork 做后台 physical preallocation。当前代码里的 contract 更接近:
+
+- ordinary MAP page 在 first dirty 时 materialize,而不是等 mapwriter / checkpoint 再去创建物理块
+- checkpoint 和 mapwriter 只刷“已经存在”的 ordinary MAP metadata block
+- superblock flush 仍由 checkpoint 拥有
+- mapwriter 负责 ordinary MAP flush
+- mapwriter 还负责后台 preallocation / physical capacity 扩张
+- 前台在低水位压力过高时,仍可能自己做一次 one-shot preallocation
+
+因此,mapwriter 可以被理解成“MAP 元数据层上的 `bgwriter` + 物理容量预分配器”。但它不是 logical EOF 的 owner,不负责 superblock 的 checkpoint flush,也不是普通数据页写线程。
+
+## 9. compactor:负责长期空间收敛
+
+前台采用单调前进分配的好处是热路径简单,代价是物理布局会随着时间推移逐渐稀疏化。仅靠“随手回收”并不足以让长期空间占用收敛,因此需要一个后台进程负责把 live page 从稀疏 extent / segment 里迁走,最终为 reclaim 和 segment unlink 创造条件。
+
+这个进程就是 compactor。它不是 crash-recovery core 本身,但它是让“前台只发布新映射、不处置旧页”这条策略在长期空间占用上可持续的关键后台机制。
+
+这里的 compactor 和 reclaim 不是同义词。compactor 的主要动作是扫描、选择候选
+extent、relocate live page,并在条件满足时推进 reclaim boundary;reclaim 则是
+后续生命周期动作,只有在某个 segment 已经低于 reclaim boundary 且确认没有 live
+mapping 引用时,才把真正的物理 unlink 交给 sync-request / checkpointer 路径。
+
+不过,当前实现里 compactor 的首要目标并不是“尽可能快地清理干净”,而是“尽量不要干扰前台”。更准确地说,它现在是一个 best-effort、bounded、遇忙就退的后台整理器,而不是一个 aggressively reclaim 的空间清扫器。
+
+当前实现里,它大致按下面的顺序工作:
+
+- 先扫 MAP,按 extent 统计 live block 密度,找出 live 比例很低的候选区域
+- 只处理已经落在 reclaim boundary 之下的区域,并显式避开当前物理尾部
+- 对候选 extent 里的 live page 做 relocation,把映射切到新的物理页
+- 当某个 extent / segment 被搬空后,再把真正的 reclaim / unlink 延后交给后续队列处理
+
+如果从“不要干扰前台”这个角度看,当前实现最重要的约束其实是这些:
+
+- 当前台分配压力升高时,compactor 会直接跳过这一轮,而不是继续和前台抢资源
+- 每轮只处理有限数量的 relation,也只允许有限数量的 relocation move,而不是无限制地清理
+- 关键路径上的锁获取大量使用 conditional acquire;遇到正在被别人使用的 superblock 或 MAP buffer,当前实现更倾向于跳过,而不是等待
+- relocation 只有在旧映射仍然是当前已发布真相时才会提交;如果前台已经赢了更新,compactor 就放弃这次搬迁
+- 真正的 segment unlink 也不是 compactor 当场同步执行,而是通过后续 reclaim / sync-request 路径延后处理
+
+这套取舍意味着:当前 compactor 更像“温和地给前台让路”,而不是“最大化后台清理吞吐”。它首先保证前台的分配和写入不被后台整理明显拖慢。
+
+## 10. inflight / barrier:负责前后台迁移并发的一致性
+
+一旦前台和 compactor 都可能迁移同一个逻辑页,就必须有共享状态描述“这个逻辑页的迁移已经在进行中”。这就是 inflight / barrier 的职责。
+
+当前实现里,这层机制之所以是显式的,是因为 compactor relocation 目前仍是 raw physical copy,而不是 shared-buffer-aware copy。没有额外串行化的话,前台 remap、后台 relocation、以及物理写入就可能围绕同一个 `lblk` 发生冲突。
+
+inflight / barrier 解决的不是空间管理策略,而是迁移动作的并发一致性:
+
+- 防止同一个 `lblk` 被并发发布多个新映射
+- 让 loser 等到稳定的 committed MAP truth,而不是借用别人的 owner-local target
+- 把前后台冲突收敛成 owner / claim / barrier 语义
+
+因此,更准确的划分是:
+
+- `map entry + superblock + WAL` 定义 crash recovery 需要恢复的持久状态真相
+- inflight / barrier 定义运行时如何安全地发布这些状态变化
+
+前者解决“redo 最终要恢复什么”,后者解决“并发执行时谁可以发布这个变化”。
+
+## 11. 文件删除与 segment 生命周期
+
+Umbra 里的文件删除不是普通的 unlink 问题,而是 segment 生命周期何时进入安全删除边界的问题。
+
+这至少涉及两类场景:
+
+- truncate / drop 驱动的删除
+- compactor 整理后触发的 reclaim 删除
+
+当前分支中,对外更适合描述的 contract 不是“pending 规则的全部细节”,而是下面这条更稳定的边界:
+
+- superblock 维护 reclaim boundary
+- compactor 通过已发布的 live mapping 和 live-map scan 判断候选区域是否仍有
+ live page,并把 live page 搬走
+- reclaim 在 segment 已经低于 reclaim boundary 且没有 live mapping 引用时,
+ 才注册后续物理 unlink
+- 真正的物理 unlink 通过 PostgreSQL 的 sync-request / checkpointer 路径延后执行
+- redo 侧要能接受这套生命周期边界,而不是只看“文件现在是不是空的”
+
+inflight / pending 状态当然仍然影响内部正确性,但它们更适合被看作实现细节,而不是对社区信件里要展开的主判据。对外最重要的点仍然是:
+
+- unlink 不是“看空即删”
+- unlink 受 reclaim boundary、live mapping、checkpoint 和 redo 语义共同约束
+
+## 12. 验证现状
+
+当前 PoC 的验证重点是“这套 owner / recovery model 在已覆盖路径上是可执行的”,而不是“所有边界都已经被穷尽证明”。
+
+目前至少已经覆盖了下面这类基础验证:
+
+- `md` 模式下的 `make check`
+- `md` 模式下的 `src/test/recovery check`
+- `Umbra` 模式下的 `make check`
+- `Umbra` 模式下的 `src/test/recovery check`
+
+Umbra 专用 recovery TAP 进一步覆盖了几类与本文主线直接相关的主题,包括:
+
+- MAP superblock / map fork policy / mapwriter activity
+- truncate / remap / 2PC remap / skip-WAL dense map redo
+- reclaim / internal segment unlink / compactor relocation
+- range remap zeroextend / ordinary slim block remap / compact birth block remap
+
+这些验证更适合支撑下面这条表述:
+
+- 这套 PoC 已经不是只停留在设计层,而是在当前覆盖路径上具备可编译、可回归、可恢复的最小闭环
+
+但下面这些点仍不能仅凭现有测试就宣称已经完全收敛:
+
+- 主库和备库 checkpoint 节奏不同时,物理页对齐问题的系统性证明与更强测试覆盖
+- `CREATE DATABASE` 的复制策略:当前支持 `FILE_COPY`;尚未支持 `WAL_LOG`,启用
+ Umbra storage manager 时会回退到 `FILE_COPY`
+- 显式 `range-born / batch mapping publish` 所有权模型
+- 内部 metadata fork / MAP fork 跨 `RELSEG_SIZE` segment 边界(例如超过 `1GB`)的专项验证
+- 更完整的原生 AIO 路径与更强的方法学压力测试
+
+## 13. 性能观察(定性,不构成严格 benchmark 结论)
+
+本节只提供当前 PoC 的定性性能信号,而不试图给出严格 benchmark 结论。方法学目前仍然偏薄:这里缺少完整硬件说明、重复轮次、误差/方差范围,以及针对各子机制的 ablation,因此下面的数据更适合作为“方向性观察”,而不是可直接用于社区性能结论的正式方法学结果。
+
+当前更稳妥的性能叙事,不应该把 `md + fpw=off` 当成 correctness-equivalent baseline。真正公平、语义对等的默认比较对象是:
+
+- `md + fpw=on`
+
+而下面这个点更适合作为“机制上界 / sensitivity point”:
+
+- `md + fpw=off`
+
+在 `master`、相同 workload 下,我们比较了三种模式:
+
+- `md + fpw=on`
+- `md + fpw=off`
+- `Umbra + fpw=on`
+
+共同测试条件如下:
+
+- `checkpoint_timeout = 2min`
+- `max_wal_size = 20GB`
+- `shared_buffers = 50GB`
+- `logging_collector = on`
+- `runMins = 10`
+- `newOrderWeight = 45`
+- `paymentWeight = 43`
+- `deliveryWeight = 4`
+- `stockLevelWeight = 4`
+- `orderStatusWeight = 4`
+
+为了避免只看百分比,下面先直接列出原始吞吐结果。
+
+`checksum=off`
+
+| 并发 | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| --- | ---: | ---: | ---: |
+| 10 | 158709 | 154283 | 155781 |
+| 50 | 577005 | 626954 | 656353 |
+| 200 | 641899 | 981436 | 995635 |
+| 500 | 322660 | 943295 | 859058 |
+| 1000 | 275609 | 899631 | 729989 |
+
+`checksum=on`
+
+| 并发 | `md + fpw=on` | `md + fpw=off` | `Umbra + fpw=on` |
+| --- | ---: | ---: | ---: |
+| 10 | 155754 | 152025 | 150606 |
+| 50 | 601974 | 635597 | 650844 |
+| 200 | 621176 | 1015923 | 938311 |
+| 500 | 316950 | 972795 | 729801 |
+| 1000 | 282713 | 891770 | 674865 |
+
+如果只看 `md + fpw=on` 与 `Umbra + fpw=on` 之间的 WAL 体积差异,在相同事务量下,按 `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` 计算,得到的比值如下:
+
+`checksum=on`
+
+| 并发 | `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` |
+| --- | ---: |
+| 10 | 1.82 |
+| 50 | 2.11 |
+| 200 | 3.81 |
+| 500 | 4.58 |
+| 1000 | 4.87 |
+
+`checksum=off`
+
+| 并发 | `WAL(md + fpw=on) / WAL(Umbra + fpw=on)` |
+| --- | ---: |
+| 10 | 2.03 |
+| 50 | 2.51 |
+| 200 | 5.22 |
+| 500 | 6.90 |
+| 1000 | 6.55 |
+
+这组数更直接地说明:在相同事务量下,Umbra 不只是回收了 ordinary checkpoint-boundary FPW 相关的吞吐损失,也明显压低了对应的 WAL 体积压力;并且随着并发升高,这种差距会进一步拉大。
+
+从这些原始数值看,`Umbra + fpw=on` 相对 `md + fpw=on` 的提升是明显且稳定的:
+
+- `checksum=off` 时,50 / 200 / 500 / 1000 并发下分别约为 `+13.8% / +55.1% / +166.2% / +164.9%`
+- `checksum=on` 时,50 / 200 / 500 / 1000 并发下分别约为 `+8.1% / +51.1% / +130.3% / +138.7%`
+
+10 并发点上,三组结果非常接近,更多像低并发区间的噪声或非 FPW 主导区;真正把差距拉开的,是 50 以上并发时 ordinary checkpoint-boundary 路径相关 I/O 成本开始反复累计的那一段。
+
+同时,`Umbra + fpw=on` 在大部分点上接近但没有完全达到 `md + fpw=off` 这条上界:
+
+- `checksum=off` 时,Umbra 在 50 和 200 并发下已经非常接近上界,但在 500 和 1000 并发下仍明显落后于 `md + fpw=off`
+- `checksum=on` 时,这个现象更明显,说明 Umbra 回收了大部分 ordinary FPW 相关 I/O 成本,但还没有吃满全部剩余系统成本
+
+因此,当前数据更适合支撑下面这条定性结论:
+
+- Umbra 明显回收了 `md + fpw=on` 相对 ordinary FPW 路径损失掉的大部分吞吐,而这更稳妥地可以理解为对相关 I/O 成本的回收
+- 这种收益在 `checksum=on` 和 `checksum=off` 两种条件下都能观察到
+- `md + fpw=off` 只能作为“如果去掉这类 FPW 成本,系统上界大概在哪里”的敏感性参照,而不能被当成语义对等基线
+
+这组数据也说明,现阶段不宜再给出过细的绝对收益归因,例如“某个前台热路径固定贡献多少、主要收益固定来自哪一项”。当前数据能支撑的是:
+
+- 很大一部分收益确实与 ordinary checkpoint-boundary FPW 路径被 remap metadata 接管、从而回收相关 I/O 成本有关
+
+但如果要继续把收益拆成“WAL 写入与同步压力下降贡献多少、数据写放大下降贡献多少、preallocation 贡献多少”,还需要专门的 ablation 和更完整的方法学说明。
+
+## 14. 当前还没做完的工程点
+
+下面这些项更适合被明确写成 follow-up,而不是暗示成已经完成的能力:
+
+1. `compactor` 工程化:已有框架,但后台收敛效率和目录发现成本控制还未完全工程化;对外仍不能把稀疏 segment 发现成本问题说成已解决;对 PoC 而言属于工程 follow-up,不阻断最小 remap/recovery 闭环。
+2. `CREATE DATABASE` 复制策略:当前支持 PostgreSQL 既有 `FILE_COPY` 目录 / 文件复制路径;尚未支持 `WAL_LOG` 逐块复制并逐块写 WAL 的路径,启用 Umbra storage manager 时会回退到 `FILE_COPY`;这属于明确限制项,不应外扩表述。
+3. `superblock shared-entry replacement`:当前仍更接近 allocate/free,而不是 replacement/eviction;现实上需要靠调大 `map_superblocks` 兜底容量压力;这属于工程 follow-up。
+4. `AIO` 集成:已完成必要适配,但不是完整的 Umbra 原生重构;不能宣称异步读写侧已经完全收敛;这仍是工程 follow-up。
+5. `range-born / batch mapping publish`:缺少上层显式接口,当前主要由 `smgr` 层保守兜底;一次性扩展多块的场景仍依赖兼容旧 AM/WAL 顺序的实现;这属于设计/工程 follow-up。
+6. `主备物理页对齐`:这类问题已经被明确识别;当前实现在 `no-image remap redo` 这条局部路径上已经显式加入了 `FlushOneBuffer()` 这类更强的发布 / 落盘约束,并有局部 recovery 覆盖;但还不能把主备物理页对齐说成已经系统性收敛,因此仍应限制更强的复制/恢复一致性表述。
+
+如果要再压缩成一句话,那么当前 PoC 更接近“核心 correctness / recovery 闭环已经建立,并且已经具备可编译、可回归、可恢复的基本形态,只是仍带着一组明确标注的 host-tree follow-up”。
+
+## 15. 当前 PoC 的语义分层
+
+这一节只说明当前 PoC 如果按 `P1-P9` 组织时,各层应该承担的语义边界。它不是
+任意工作分支与提交编号的映射,也不描述后续发布节奏。
+
+这个分层不应该按文件目录机械切开,而应该按状态机和 owner 边界来理解:前面的
+层次先建立可恢复的最小机制,后面的层次再补齐 checkpoint、mapwriter、
+compactor 等工程化能力。
+
+这个分层的目标,是让每一层都能说明自己引入了哪个 correctness owner 或工程边界:
+
+- 前面的层次尽量只建立基础机制,不把后续工程化 follow-up 混进去
+- 后面的层次再逐步引入 WAL/redo、checkpoint、mapwriter、compactor、recovery
+ tests 等能力
+- 对当前还没做完的部分,要明确写成 follow-up,而不是隐含成已经完成的能力
+
+按当前 PoC 分支定义,更自然的拆分顺序应该收敛成 `P1-P9`:
+
+- `P1`:建立 `smgr` 实现边界,引入 `--with-umbra` 选择点,保证普通 `md`
+ 路径不被改变
+- `P2`:引入 `umfile` 物理文件层和 metadata storage primitive,先补齐物理文件、
+ segment、create / unlink、read / write / extend / truncate 等底层能力
+- `P3`:引入 metadata 磁盘格式和 identity mapping 启动路径,让 metadata fork、
+ superblock layout、初始映射状态先独立成立
+- `P4`:引入共享内存 MAP cache 和 checkpoint flush 基础,让 MAP metadata 的缓存、
+ materialize、dirty / flush 语义先独立成立
+- `P5`:引入 MAP 访问策略、逻辑到物理的翻译,以及 materialization contract,让
+ `MAIN/FSM/VM` 在上层继续使用逻辑块号,在 `smgr` 之下完成 `lblk -> pblk`
+ 解析
+- `P6`:引入 WAL record、mapped birth 和 redo 状态机,先把 `MAP_SET`、truncate、
+ metadata lifecycle、skip-WAL pending 等最小 WAL/redo owner 建起来
+- `P7`:引入 ordinary remap、block reference remap 和 checkpoint-boundary FPW
+ replacement,让 ordinary checkpoint-boundary image path 的替代表达方式闭环
+- `P8`:补齐 checkpoint / mapwriter 回写和物理预分配,让 MAP metadata 回写、
+ 后台预分配、低水位压力下的一次性前台预分配有清晰 owner
+- `P9`:引入 compactor 框架和不干扰前台的策略,把 inflight / barrier、reclaim、
+ delayed unlink、compactor relocation 收敛成后台整理框架
+
+这个顺序的重点是:前面的层次先把 correctness owner model 建起来,后面的层次再逐步处理工程化压力。尤其是 `CREATE DATABASE` 复制策略、AIO、主备物理页一致性这些 host-tree 集成问题,不应该被伪装成核心机制已经收敛的一部分;它们更适合作为明确的 follow-up 单独讨论。
+
+因此,这里的 `P1-P9` 只表达当前 PoC 的语义边界:哪些属于核心正确性最小闭环,哪些属于工程化增强,哪些仍然只是兼容性兜底或后续集成项。测试和文档也应该按相关语义层次归属,而不是单独抽成一个泛化的“测试/文档层”。
+
+## 16. 总结
+
+Umbra 不是在宣称“PostgreSQL 从此不再需要 full-page image”,也不该被宽泛地描述成一个新的“存储引擎”;更准确地说,它是在 PostgreSQL `storage manager` / 物理存储层上,针对 ordinary checkpoint-boundary FPW 路径提供一套 remap-based 的恢复基线表达方案。它的核心是:
+
+- 在 storage 层把逻辑页身份和物理放置拆开
+- 用 map entry 描述单页映射真相
+- 用 superblock 描述 fork 级全局真相
+- 用 WAL 保证这些状态变化以原子、可恢复的方式发布
+
+在此基础上:
+
+- 前台总是分配新物理页,不在热路径中处置旧页
+- mapwriter 主要负责 MAP 元数据侧的后台平滑,而不是独占所有扩张责任
+- compactor 负责长期空间收敛
+- inflight / barrier 保证前后台迁移和物理写入的并发安全
+- 文件删除和 segment 生命周期由 reclaim boundary、live mapping、checkpoint 和 redo 语义共同约束
+
+这套设计的价值,不是单点微优化,也不是一句“关闭 FPW”就能概括。它真正挑战的是 ordinary checkpoint-boundary FPW 所绑定的那组成本模型,同时尽量保持 `md + fpw=on` 所要求的 crash-recovery 语义。
+
+## 附录:实现过程透明度声明
+
+这组代码的形成过程需要保持透明。Umbra 的核心架构、边界划分和关键状态机,
+来自作者本人对 PostgreSQL storage / WAL / recovery 语义的设计和原型化工作。
+作者也维护了早期验证原型 `shadow`:
+<https://github.com/nayishan/postgre_umbra/tree/shadow-pg12-archive>。
+
+为了把原型扩展成当前规模的 PoC,作者在具体实现、样板代码扩展和局部重构中
+大量使用了 AI 编码助手(例如 Codex)。这些实现大量依赖前面的逻辑梳理、
+`shadow` 原型,以及 PostgreSQL 现有实现的代码形状和调用顺序。
+
+这里的责任边界也需要说清楚:核心设计、边界定义和关键逻辑判断由作者负责;
+AI 主要用于加速繁琐实现细节。当前 AI 仍不能独立把握数据库内核中的并发时序、
+owner model 和 crash-recovery 语义,因此代码中仍可能存在风格不统一或需要后续
+工程化收敛的区域。当前定位仍是 PoC,而不是完成最终 host-tree polish 的成品。
diff --git a/doc/umbra/WAL_AND_REDO.md b/doc/umbra/WAL_AND_REDO.md
new file mode 100644
index 0000000000..3feddd66a6
--- /dev/null
+++ b/doc/umbra/WAL_AND_REDO.md
@@ -0,0 +1,419 @@
+# Umbra WAL and Redo Semantics on PostgreSQL Master
+
+This document describes the current WAL payload and redo rules used by the
+PostgreSQL master Umbra PoC.
+
+The design has two WAL-visible pieces:
+
+- remap metadata attached to ordinary block references
+- Umbra rmgr records for MAP lifecycle operations
+
+Those two mechanisms are complementary. They solve different problems and are
+replayed in different layers.
+
+## 1. Ordinary Block Records with Remap Metadata
+
+Umbra extends ordinary WAL block references with an extra block-header payload
+when `BKPBLOCK_HAS_REMAP` is set.
+
+The full payload is:
+
+- `old_pblkno`
+- `new_pblkno`
+- `logical_nblocks`
+- `next_free_pblkno`
+
+The meaning of those fields is:
+
+- `old_pblkno`
+ - the old published physical baseline for this logical block
+ - `InvalidBlockNumber` means first published mapping
+- `new_pblkno`
+ - the physical block that becomes the new published target
+- `logical_nblocks`
+ - logical frontier payload needed when redo is publishing a first-born page
+- `next_free_pblkno`
+ - allocator frontier payload that keeps replay-side physical allocation
+ deterministic
+
+The remap header does not try to encode every superblock fact. It only carries
+the block-local transition plus the frontier state redo needs to keep the MAP
+view deterministic.
+
+The current record-level remap format is encoded in `xl_info`:
+
+- full remap:
+ - `old_pblkno`
+ - `new_pblkno`
+ - `logical_nblocks`
+ - `next_free_pblkno`
+- compact birth:
+ - `new_pblkno`
+ - `logical_nblocks`
+ - `next_free_pblkno`
+- ordinary slim:
+ - `old_pblkno`
+ - `new_pblkno`
+ - `next_free_pblkno`
+
+Compact birth records omit `old_pblkno`, because first-born remaps always use
+`InvalidBlockNumber` for the old physical block.
+
+Ordinary slim records omit `logical_nblocks`, because ordinary remap has a
+valid old physical baseline and does not publish a first-born logical EOF.
+
+The branch deliberately does not use a "tiny birth" header that carries only
+`new_pblkno`. Such a format is only safe when the replay-side frontier can be
+derived locally. That condition is too narrow for the current owner model, so
+compact birth remains the conservative birth fallback.
+
+## 2. Producer-Side Decisions in `xloginsert.c`
+
+Producer-side logic lives in `XLogRecordAssembleUmbra()`.
+
+The code still starts from PostgreSQL's normal questions:
+
+- does this record need a backup image?
+- does it need data payload?
+
+Umbra then adds a second question:
+
+- does this block record need remap metadata?
+
+Those decisions are related, but they are not collapsed into one boolean.
+
+### 2.1 Automatic checkpoint-boundary remap
+
+For ordinary data-bearing records:
+
+- `REGBUF_FORCE_IMAGE` means:
+ - backup image yes
+ - automatic remap no
+- `REGBUF_NO_IMAGE` means:
+ - backup image no
+ - automatic remap no
+- `!doPageWrites` means:
+ - backup image no
+ - automatic remap no
+- `RM_XLOG_ID / XLOG_FPI_FOR_HINT` keeps MD's hint-image rule:
+ - backup image if `page_lsn <= RedoRecPtr`
+ - no remap
+- ordinary checkpoint-boundary case means:
+ - no backup image
+ - remap if `page_lsn <= RedoRecPtr`
+
+That last rule is where Umbra replaces MD's ordinary checkpoint-boundary backup
+image path with remap-aware WAL.
+
+### 2.2 `REGBUF_LOGICAL_BIRTH`
+
+Umbra also supports an explicit first-born owner path through
+`REGBUF_LOGICAL_BIRTH`.
+
+When a registered buffer is marked logical-birth and does not already carry
+remap metadata, WAL assembly:
+
+1. opens the relation
+2. tries to find an already-published mapping
+3. tries to find a pending reserved mapping
+4. only if both are absent, reserves a fresh physical block
+
+The outcome is then recorded as:
+
+- `old_pblkno = InvalidBlockNumber`
+- `new_pblkno = chosen physical block`
+- `has_remap = true`
+
+This is not "always allocate a new pblk immediately". It is "WAL assembly owns
+the first-born publication if no prior mapping or reservation already exists".
+The chosen `pblk` may come from a runtime reservation frontier, but that
+reservation is not yet committed superblock state.
+
+### 2.3 When remap metadata is included
+
+The current inclusion rule is:
+
+- if the block already has remap metadata, include it
+- otherwise, if the automatic remap rule says this record needs remap, build
+ and include it
+
+So remap-bearing records come from two sources:
+
+- explicit logical-birth ownership
+- ordinary checkpoint-boundary remap
+
+### 2.4 Interaction with images
+
+When remap metadata is included:
+
+- `BKPBLOCK_HAS_REMAP` is set
+- the remap header is filled
+
+If the remap came from the ordinary checkpoint-boundary path, Umbra suppresses
+the ordinary backup image unless:
+
+- the caller explicitly forced an image, or
+- `XLR_CHECK_CONSISTENCY` requires one
+
+That means the current rule is not:
+
+- "full-page image and remap are always mutually exclusive"
+
+It is:
+
+- "automatic checkpoint-boundary remap replaces the default image path"
+- explicit image owners can still coexist with remap metadata
+
+## 3. Post-Insert Publication
+
+After WAL insertion succeeds, `XLogCommitBlockRemapsUmbra()` publishes the
+winner state for each block record that carried remap metadata.
+
+That commit step:
+
+- installs the new mapping with `UmMapSetMapping()`
+- bumps committed `next_free_pblkno` when needed
+- bumps `logical_nblocks` for first-born publication
+- updates cached relation size state for WAL-owned first-born pages
+- releases pending reservations
+
+This is an important owner boundary:
+
+- the block record is assembled before insert
+- publication becomes durable owner state only after insert succeeds
+- runtime reservation state may run ahead transiently, but checkpoint-visible
+ superblock state is published only at this boundary
+
+## 4. Umbra RMGR Records
+
+Umbra also has a small rmgr (`RM_UMBRA_ID`) for MAP lifecycle operations.
+
+Current records include:
+
+- `XLOG_UMBRA_MAP_SET`
+- `XLOG_UMBRA_RANGE_REMAP`
+- `XLOG_UMBRA_RANGE_REMAP_COMPACT`
+- `XLOG_UMBRA_SKIP_WAL_DENSE_MAP`
+- `XLOG_UMBRA_RECLAIM_UNLINK`
+
+`XLOG_UMBRA_SKIP_WAL_DENSE_MAP` is a redo anchor for skip-WAL relations. It
+does not replace the skip-WAL sync protocol. For each encoded fork it states:
+
+- `[0, nblocks)` is dense
+- `pblk == lblk` in that range
+- `logical_nblocks = nblocks`
+- `physical_nblocks = nblocks`
+- `next_free_pblkno = nblocks`
+
+The producer should not encode empty forks. A `nblocks == 0` entry has no
+mapping work and does not advance any frontier.
+
+These records are not a replacement for block-header remap metadata.
+
+Their purpose is different:
+
+- block-header remap metadata is for ordinary WAL block replay
+- Umbra rmgr records are for explicit MAP lifecycle actions such as
+ compactor/reclaim/state maintenance
+
+### 4.1 Why this branch does not use an explicit RangeMap / range-born owner model
+
+The current branch deliberately does not make `range-born / batch mapping
+publish` a first-class upper-layer contract.
+
+The reason is not that range publication is impossible. The problem is owner
+clarity.
+
+At the PostgreSQL call sites Umbra currently has explicit ownership for:
+
+- one logical block being born for the first time
+- one logical block being remapped at a checkpoint-boundary WAL site
+- explicit Umbra-internal lifecycle records such as compactor/reclaim work
+
+What is still missing is a concrete upper-layer use site that already owns
+range publication as one semantic unit. A future example could be something
+like a hash-AM split/redistribution path where a well-defined logical range is
+materialized and published under one owner. The current branch does not wire
+such a caller yet.
+
+It does **not** yet have a generic upper-layer interface that says:
+
+- this WAL owner is publishing a whole logical range at once
+- this range has one well-defined ordering point
+- redo can treat that range as a single published unit
+
+Without that interface, a generic RangeMap-style contract would push too much
+ambiguity into WAL assembly and redo:
+
+- which layer owns range extent vs. per-block visibility
+- when logical EOF becomes durable for the whole range
+- how allocator frontier publication is synchronized with older AM/WAL ordering
+- whether a later block in the same range may become visible before an earlier
+ block's WAL ownership is fully established
+
+The current branch therefore stays conservative:
+
+- ordinary upper-layer WAL continues to publish remap state per block
+- first-born publication remains explicit per block
+- range-shaped operations are limited to internal Umbra-controlled lifecycle
+ paths where ownership is already local and bounded
+
+That is why the branch has range remap records for internal lifecycle work, but
+does not yet describe a generic upper-layer RangeMap contract as a settled
+feature.
+
+## 5. Redo Entry in `xlogutils.c`
+
+Redo-side interpretation lives in `XLogReadBufferForRedoExtendedUmbra()`.
+
+The redo entry layer owns:
+
+- metadata bootstrap for mapped-fork redo
+- redo-time MAP state seeding
+- interpretation of `has_remap`
+- distinction between remap-with-image and remap-without-image
+
+It intentionally does not push those semantics down into a generic read helper,
+because the generic helper does not know:
+
+- whether the record has remap metadata
+- whether the record has a block image
+- what `old_pblkno` and `new_pblkno` are
+
+## 6. Redo Bootstrap Before Replay
+
+Before replaying mapped-fork data, redo first ensures:
+
+- the relation is open
+- MAP state is seeded on the `SMgrRelation`
+- the metadata fork exists when the fork requires mapping
+
+That work is currently done by:
+
+- `XLogUmbraMapStateForRedo()`
+- `XLogUmbraEnsureMetadataForRedo()`
+- `XLogUmbraEnsureMappedBlockForRedo()`
+
+This is redo-only bootstrap logic and intentionally belongs at the redo-entry
+layer.
+
+## 7. Redo Cases
+
+### 7.1 No remap metadata
+
+If `has_remap` is false, Umbra first ensures metadata for mapped forks and then
+falls back to the ordinary PostgreSQL-style block restore/read path:
+
+- image -> restore image
+- no image -> read current block view and compare LSN
+
+This is the least interesting Umbra case; it mostly behaves like md plus
+metadata availability checks.
+
+### 7.2 Remap with image
+
+If `has_remap` and `has_image` are both true, redo:
+
+1. installs the new mapping immediately
+2. bumps `next_free_pblkno` if provided
+3. bumps `logical_nblocks` for first-born publication
+4. ensures the mapped block exists
+5. restores the block image into that new mapping view
+
+This is the current "phase-1" remap replay path.
+
+### 7.3 Remap without image, zero/init mode
+
+If `has_remap` is true, `has_image` is false, and redo is in zero/init mode,
+redo:
+
+1. installs the new mapping immediately
+2. bumps frontier payload as needed
+3. ensures the mapped block exists
+4. reads the block in zero/init mode
+
+This covers first-born and initialization-style replay where no old physical
+baseline is needed.
+
+### 7.4 Remap without image, ordinary mode
+
+This is the most Umbra-specific case.
+
+Redo requires a valid old physical baseline. A delta-only remap is therefore
+replayed as old physical page plus WAL delta, not as overwrite-in-place on the
+new physical page. Redo does not publish the new mapping first; instead it:
+
+1. temporarily installs `old_pblkno` as the current mapping
+2. ensures the old mapped block is readable
+3. reads and locks the buffer through that old mapping view
+4. dirties and flushes that buffer state
+5. switches the mapping to `new_pblkno`
+6. bumps `next_free_pblkno` if carried in the record
+
+The important rule is:
+
+- remap-without-image replay first reads through the old mapping view
+- it applies the WAL delta against that old physical baseline
+- it publishes the new mapping only after that baseline has been consumed
+
+That is what makes delta replay deterministic without requiring a full-page
+image in the ordinary checkpoint-boundary case.
+
+## 8. First-Born Pages
+
+A first-born page is identified by:
+
+- `old_pblkno == InvalidBlockNumber`
+
+Current first-born handling is split:
+
+- producer side may reserve and publish WAL-owned first-born remap metadata
+- post-insert publication bumps logical frontier
+- redo side uses `logical_nblocks` payload to keep replay-side logical EOF in
+ sync
+
+This avoids depending on generic `smgrextend()` ownership for WAL-owned logical
+births.
+
+## 9. Metadata Fork and Redo
+
+Mapped-fork redo depends on metadata-fork availability.
+
+The current rule is:
+
+- redo creates metadata when mapped replay requires it
+- normal data paths should not repeatedly rediscover metadata existence
+
+This keeps redo-only bootstrap in redo owner code instead of leaking it into
+unrelated access paths.
+
+## 10. Current Conservative Choices
+
+The branch still chooses conservative rules in a few places:
+
+- explicit image owners keep their image semantics
+- first-born and initialization cases carry dedicated frontier payload when it
+ cannot be derived from a stronger WAL anchor
+- checksum-driven hint FPIs still use PostgreSQL's `XLOG_FPI_FOR_HINT` path
+- redo keeps a very explicit old-view/new-view split for remap-without-image
+- remap format is record-level, so mixed birth/ordinary remap records can fall
+ back to the full header rather than using per-block variant tags
+
+Those rules are deliberate. The current branch favors deterministic ownership
+and clear replay state over collapsing every case into a smaller but harder to
+reason about WAL contract.
+
+## 11. Summary
+
+The current Umbra WAL/redo design can be summarized as:
+
+- ordinary block records may carry remap metadata
+- remap metadata records physical transition plus frontier state
+- WAL publication of that state is committed only after insert succeeds
+- redo explicitly distinguishes no-remap, remap-with-image, and
+ remap-without-image
+- Umbra rmgr records remain available for MAP lifecycle operations outside the
+ ordinary block-header remap path
+
+That is the basis on which the current master PoC reduces ordinary
+checkpoint-boundary backup-image pressure while keeping replay deterministic.
diff --git a/doc/umbra/WAL_AND_REDO_ZH.md b/doc/umbra/WAL_AND_REDO_ZH.md
new file mode 100644
index 0000000000..f850142bcb
--- /dev/null
+++ b/doc/umbra/WAL_AND_REDO_ZH.md
@@ -0,0 +1,248 @@
+# Umbra 的 WAL 与 redo 语义
+
+本文档是 `WAL_AND_REDO.md` 的中文配套版本,说明当前 Umbra 原型中的 WAL
+内容和 redo 规则。
+
+## 1. 两类会出现在 WAL 中的机制
+
+Umbra 有两类会出现在 WAL 中的机制:
+
+- 普通 WAL block reference 上的 remap 元数据;
+- Umbra 自己的 rmgr 记录。
+
+两者不是替代关系:
+
+- block-header 里的 remap 信息,用来让普通 WAL block 回放时找到正确的物理基线;
+- Umbra rmgr 记录,用来表达 MAP 的生命周期事件。
+
+## 2. remap header
+
+当普通 block reference 设置 `BKPBLOCK_HAS_REMAP` 时,会带上 remap 的字段。
+
+完整字段如下:
+
+- `old_pblkno`
+- `new_pblkno`
+- `logical_nblocks`
+- `next_free_pblkno`
+
+它们分别表示:
+
+- `old_pblkno`
+ - 当前逻辑块旧的已发布物理基线;
+ - `InvalidBlockNumber` 表示 first-born。
+- `new_pblkno`
+ - 即将发布的新物理块。
+- `logical_nblocks`
+ - 在 first-born 或 range birth 时需要推进的逻辑 EOF。
+- `next_free_pblkno`
+ - 已提交的分配前沿;
+ - redo 用它来保持物理分配的确定性。
+
+要注意:`next_free_pblkno` 不一定等于 `new_pblkno + 1`。它表示的是全局的、
+已提交的分配前沿。
+
+## 3. WAL 生成端规则
+
+WAL 生成端的逻辑在 `XLogRecordAssembleUmbra()`。
+
+PostgreSQL 原本会判断:
+
+- 是否需要备份镜像;
+- 是否需要数据载荷。
+
+Umbra 额外增加一个判断:
+
+- 是否需要 remap 元数据。
+
+这些判断不能被压成一个单独的布尔值。
+
+在 checkpoint 边界上的普通场景里,如果页面满足自动 remap 的条件,Umbra 会用
+remap 元数据替代默认的 full-page image 路径。
+
+保守边界如下:
+
+- `REGBUF_FORCE_IMAGE` 保留 image 语义;
+- `REGBUF_NO_IMAGE` 不自动 remap;
+- `!doPageWrites` 不自动 remap;
+- `XLOG_FPI_FOR_HINT` 继续沿用 PostgreSQL 的 hint image 规则;
+- `XLR_CHECK_CONSISTENCY` 保留校验 image。
+
+## 4. first-born
+
+`REGBUF_LOGICAL_BIRTH` 是显式的 first-born owner 路径。
+
+WAL 组装时会:
+
+1. 打开 relation;
+2. 查找是否已经有已发布的 mapping;
+3. 查找是否已经有 pending 的预留 mapping;
+4. 如果两者都没有,就预留一个新的物理块。
+
+随后记录:
+
+- `old_pblkno = InvalidBlockNumber`
+- `new_pblkno = 选中的物理块`
+- `has_remap = true`
+
+这里选中的物理块可能来自运行时的预留前沿,但在 WAL insert 成功之前,它还
+不是 superblock 中已经提交的状态。
+
+## 5. WAL insert 之后的发布
+
+`XLogCommitBlockRemapsUmbra()` 在 WAL insert 成功后发布 remap 状态。
+
+它负责:
+
+- 安装新的 `lblk -> pblk` 映射;
+- 必要时推进已提交的 `next_free_pblkno`;
+- 在 first-born 时推进 `logical_nblocks`;
+- 更新由 WAL 拥有的 first-born relation size cache;
+- 释放 pending 预留。
+
+这个边界非常重要:
+
+- WAL 组装阶段可以先选择物理块;
+- 只有 WAL insert 成功后,才发布已提交的映射和前沿;
+- 运行时预留前沿可以领先;
+- 磁盘 superblock 中的已提交前沿不能领先于 WAL。
+
+## 6. Umbra 的 rmgr 记录
+
+当前 Umbra 的 rmgr 记录包括:
+
+- `XLOG_UMBRA_MAP_SET`
+- `XLOG_UMBRA_RANGE_REMAP`
+- `XLOG_UMBRA_RANGE_REMAP_COMPACT`
+- `XLOG_UMBRA_SKIP_WAL_DENSE_MAP`
+- `XLOG_UMBRA_RECLAIM_UNLINK`
+
+其中 `XLOG_UMBRA_SKIP_WAL_DENSE_MAP` 是 skip-WAL relation 的映射锚点。它表示:
+
+- `[0, nblocks)` 是 dense;
+- `pblk == lblk`;
+- `logical_nblocks = nblocks`;
+- `physical_nblocks = nblocks`;
+- `next_free_pblkno = nblocks`。
+
+它不是数据文件 `fsync` 的替代品;skip-WAL 的同步协议仍然独立存在。
+
+### 6.1 为什么当前不把 RangeMap / range-born 做成正式机制
+
+当前这条分支有意**没有**把 `range-born / batch mapping publish` 做成一个正式的
+上层 contract。
+
+原因不是“范围发布做不到”,而是当前还缺少足够清晰的 owner 边界。
+
+在 PostgreSQL 的现有调用点上,Umbra 目前能明确拿到的 owner 主要是:
+
+- 单个逻辑块的 first-born;
+- checkpoint 边界上单个逻辑块的 remap;
+- `compactor` / `reclaim` 这类 Umbra 内部生命周期记录。
+
+当前真正缺少的,是一个已经天然按“范围”拥有发布语义的上层使用点。未来如果有
+类似哈希访问方法做 split / redistribution 这样的路径,能够把一个逻辑范围作为
+单一 owner 单元来扩展、写 WAL 并发布,那么 range remap 才有比较自然的落点。
+当前这条分支还没有接上这样的调用点。
+
+但它还没有一个通用的上层接口,可以明确表达:
+
+- 这次 WAL owner 要一次性发布一个逻辑范围;
+- 这个范围有一个清晰、唯一的顺序边界;
+- redo 可以把整个范围当成一个已发布单元来处理。
+
+如果在缺少这种接口的情况下,强行引入通用的 RangeMap 式 contract,就会把太多
+歧义推到 WAL 组装和 redo 阶段:
+
+- 范围大小由谁拥有,单块可见性又由谁拥有;
+- 整个范围的逻辑 EOF 在什么时候算持久化完成;
+- 分配前沿的发布如何和现有 AM / WAL 顺序保持一致;
+- 同一个范围里,后面的块会不会在前面的块完成 WAL owner 建立之前就先变得可见。
+
+所以当前分支选择更保守的做法:
+
+- 普通上层 WAL 仍然按“单块”发布 remap 状态;
+- first-born 仍然按“单块”显式发布;
+- 只有在 Umbra 自己完全控制、owner 边界已经收紧的内部生命周期路径上,
+ 才使用范围形态的操作。
+
+这也是为什么当前分支里会有 range remap 记录,但还不能把“通用上层 RangeMap
+contract”描述成已经收敛完成的能力。
+
+## 7. redo 入口
+
+redo 端的核心入口在 `XLogReadBufferForRedoExtendedUmbra()`。
+
+redo 入口层负责:
+
+- mapped fork 的 metadata bootstrap;
+- redo 阶段的 MAP 状态播种;
+- 解释 `has_remap`;
+- 区分带 image 的 remap 和不带 image 的 remap。
+
+这些逻辑不能下推到普通 read helper,因为普通 helper 不理解 remap header 的
+所有权语义。
+
+## 8. redo 的几种场景
+
+### 8.1 无 remap
+
+没有 remap 元数据时,Umbra 会先确保 metadata 存在,然后走接近 PostgreSQL
+普通 redo 的路径:
+
+- 有 image:恢复 image;
+- 无 image:读取当前 block view 并比较 LSN。
+
+### 8.2 带 image 的 remap
+
+有 remap 且有 image 时:
+
+1. 先安装新 mapping;
+2. 推进前沿;
+3. 在 first-born 时推进逻辑 EOF;
+4. 确保 mapped block 已存在;
+5. 把 image 恢复到新的 mapping view。
+
+### 8.3 不带 image 的 remap(zero/init)
+
+有 remap、没有 image,而且 redo 是 zero/init 模式时:
+
+1. 安装新 mapping;
+2. 推进前沿;
+3. 确保 mapped block 已存在;
+4. 按 zero/init 模式读取。
+
+### 8.4 不带 image 的 remap(普通 delta 回放)
+
+这是最关键的 Umbra 场景。
+
+普通的、没有 image 的 delta remap 需要旧物理基线。它的回放语义是“旧物理页 +
+WAL delta”,不是在新物理页上原地覆盖。因此 redo 不能先发布新的 mapping,而是
+必须:
+
+1. 临时安装 `old_pblkno`;
+2. 通过旧 mapping view 读取页面;
+3. 锁住并修改 buffer;
+4. 消费完旧基线后再切换到 `new_pblkno`;
+5. 推进 `next_free_pblkno`。
+
+核心规则是:
+
+- remap-without-image redo 先通过旧 mapping view 读取页面;
+- WAL delta 作用在这个旧物理基线上;
+- 消费完旧基线后,redo 才发布新的 mapping。
+
+这样一来,checkpoint 边界上的普通场景就可以在不依赖 full-page image 的
+情况下,仍然保持 redo 的确定性。
+
+## 9. 总结
+
+当前 WAL / redo 模型可以概括成:
+
+- 普通 block record 可以携带 remap 元数据;
+- remap 元数据表达物理迁移和前沿信息;
+- 只有 WAL insert 成功后才发布已提交的映射和前沿;
+- redo 明确区分无 remap、带 image 的 remap、以及不带 image 的 remap;
+- Umbra rmgr 记录负责普通 block header 之外的 MAP 生命周期事件。
+
+这套模型是 Umbra 降低 checkpoint 边界上 ordinary FPI 压力的正确性基础。
--
2.50.1 (Apple Git-155)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Mingwei Jia | 2026-06-01 23:33:32 | [RFC PATCH v2 RESEND 02/10] umbra: add patch 1 smgr implementation boundary |
| Previous Message | Srinivas Kumar | 2026-06-01 23:27:17 | Re: DBeaver Experiencing timeouts while connecting to New Linux PostgreSql server |