| From: | Haibo Yan <tristan(dot)yim(at)gmail(dot)com> |
|---|---|
| To: | Mohamed ALi <moali(dot)pg(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: [PATCH] Fix: Partitioned parent index remains invalid after child indexes are repaired |
| Date: | 2026-04-11 00:55:29 |
| Message-ID: | CABXr29GbJe7-vbm0Fn=HWx3vQonwFtv_zO1pSc5FZAvjLuUDVA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Apr 8, 2026 at 11:17 PM Mohamed ALi <moali(dot)pg(at)gmail(dot)com> wrote:
> Hi hackers,
>
> A partitioned (parent) index in PostgreSQL can become permanently
> stuck with `indisvalid = false` even after all of its child partition
> indexes have been repaired and are valid. There is no built-in
> mechanism to re-validate the parent index after a child is fixed via
> `REINDEX`. This affects all currently supported PostgreSQL versions
> (13 through 18)
> The root cause is that `validatePartitionedIndex()` — the only
> function that can mark a partitioned index as valid is never called
> after `REINDEX` operations, and is skipped when re-running `ALTER
> INDEX ATTACH PARTITION` on an already-attached index.
>
> How the Bug Manifests
>
> Typical Scenario :
> 1. A partitioned table has multiple partitions.
> 2. The user creates indexes on partitions concurrently. One fails (due
> to deadlock, cancellation, timeout, etc.), leaving an invalid
> partition index.
> 3. A parent index is created (or the invalid index is attached to an
> existing parent). The parent is correctly marked `indisvalid = false`
> because at least one child is invalid.
> 4. The user fixes the broken child index with `REINDEX INDEX CONCURRENTLY`.
> 5. The child index becomes valid (`indisvalid = true`).
> 6. The parent index remains `indisvalid = false` permanently. No SQL
> command can fix it.
>
> Reproduction steps:
>
> ```sql
> -- ============================================================
> -- SETUP: Partitioned table with two partitions and sample data
> -- ============================================================
> DROP TABLE IF EXISTS orders;
> CREATE TABLE orders (
> id serial,
> order_date date NOT NULL,
> amount numeric
> ) PARTITION BY RANGE (order_date);
> CREATE TABLE orders_2023 PARTITION OF orders
> FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
> CREATE TABLE orders_2024 PARTITION OF orders
> FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
> INSERT INTO orders (order_date, amount)
> SELECT d, random() * 1000
> FROM generate_series('2023-01-01'::date, '2023-12-31'::date, '1 day') d;
> INSERT INTO orders (order_date, amount)
> SELECT d, random() * 1000
> FROM generate_series('2024-01-01'::date, '2024-12-31'::date, '1 day') d;
> -- ============================================================
> -- STEP 1: Create parent index with ONLY (starts as invalid)
> -- ============================================================
> CREATE INDEX orders_amount_idx ON ONLY orders (amount);
> -- Verify: parent index is invalid (no children attached yet)
> SELECT c.relname, i.indisvalid
> FROM pg_class c
> JOIN pg_index i ON c.oid = i.indexrelid
> WHERE c.relname LIKE 'orders%idx%'
> ORDER BY c.relname;
> -- Expected:
> -- orders_amount_idx | f
> -- ============================================================
> -- STEP 2: Create valid index on first partition
> -- ============================================================
> CREATE INDEX CONCURRENTLY orders_2023_amount_idx ON orders_2023 (amount);
> -- ============================================================
> -- STEP 3: Create an INVALID index on second partition
> -- ============================================================
> -- In a separate session, hold a lock:
> BEGIN; LOCK TABLE orders_2024 IN SHARE MODE;
> -- Then in the main session:
> SET statement_timeout = '1ms';
> CREATE INDEX CONCURRENTLY orders_2024_amount_idx ON orders_2024 (amount);
> RESET statement_timeout;
> -- it will fail/timeout, leaving an invalid index.
> -- Verify state:
> SELECT c.relname, i.indisvalid
> FROM pg_class c
> JOIN pg_index i ON c.oid = i.indexrelid
> WHERE c.relname LIKE 'orders%idx%'
> ORDER BY c.relname;
> -- Expected:
> -- orders_2023_amount_idx | t (valid)
> -- orders_2024_amount_idx | f (invalid)
> -- orders_amount_idx | f (invalid, created with ONLY)
> -- ============================================================
> -- STEP 4: Attach both partition indexes to the parent
> -- ============================================================
> -- Attach the invalid one first
> ALTER INDEX orders_amount_idx ATTACH PARTITION orders_2024_amount_idx;
> -- Succeeds. Parent stays invalid (correct — child is invalid).
> -- Attach the valid one
> ALTER INDEX orders_amount_idx ATTACH PARTITION orders_2023_amount_idx;
> -- Succeeds. Parent still invalid (correct — one child still invalid).
> -- Verify attachment and validity:
> SELECT c.relname, i.indisvalid,
> pg_get_indexdef(i.indexrelid) AS indexdef
> FROM pg_class c
> JOIN pg_index i ON c.oid = i.indexrelid
> WHERE c.relname LIKE 'orders%amount%'
> ORDER BY c.relname;
> -- Expected:
> -- orders_2023_amount_idx | t
> -- orders_2024_amount_idx | f
> -- orders_amount_idx | f
> -- ============================================================
> -- STEP 5: Fix the invalid child index via REINDEX
> -- ============================================================
> REINDEX INDEX CONCURRENTLY orders_2024_amount_idx;
> -- Verify: child is now valid
> SELECT c.relname, i.indisvalid
> FROM pg_class c
> JOIN pg_index i ON c.oid = i.indexrelid
> WHERE c.relname LIKE 'orders%amount%'
> ORDER BY c.relname;
> -- ACTUAL (buggy) result:
> -- orders_2023_amount_idx | t (valid)
> -- orders_2024_amount_idx | t (valid — fixed by REINDEX)
> -- orders_amount_idx | f (STILL INVALID — this is the bug!)
> --
> -- EXPECTED result (if bug were fixed):
> -- orders_2023_amount_idx | t
> -- orders_2024_amount_idx | t
> -- orders_amount_idx | t (should be valid now)
> -- ============================================================
> -- STEP 6: Demonstrate that re-running ATTACH does not help
> -- ============================================================
> ALTER INDEX orders_amount_idx ATTACH PARTITION orders_2024_amount_idx;
> -- Returns "ALTER INDEX" (succeeds silently, does nothing)
> SELECT c.relname, i.indisvalid
> FROM pg_class c
> JOIN pg_index i ON c.oid = i.indexrelid
> WHERE c.relname LIKE 'orders%amount%'
> ORDER BY c.relname;
> -- Parent is STILL invalid. The "silently do nothing" path
> -- skips validatePartitionedIndex() entirely.
> -- ============================================================
> -- CLEANUP
> -- ============================================================
> DROP TABLE orders;
> ```
>
>
> Root Cause Analysis:
>
> Where `validatePartitionedIndex()` Is Called
>
> The function is called in exactly these code paths:
> 1. During `ALTER INDEX ... ATTACH PARTITION` — inside
> `ATExecAttachPartitionIdx()`
> 2. During `ALTER TABLE ... ATTACH PARTITION` — via
> `AttachPartitionEnsureIndexes()`
> 3. During `CREATE INDEX` on partitioned tables — via `DefineIndex()`
> It is NOT called:
> - After `REINDEX` of a partitioned index
> - During any maintenance operation
> - As any periodic validation check
>
> Bug 1: REINDEX Does Not Validate Parent
>
>
> When `reindex_index()` in `src/backend/catalog/index.c` marks a
> partition index as valid (setting `indisvalid = true`), it does not
> check whether the parent partitioned index should also become valid.
> The function simply updates the child's `pg_index` entry and returns.
>
> Bug 2: Re-running ATTACH Skips Validation
>
>
> In `ATExecAttachPartitionIdx()` (tablecmds.c, around line 21923 in PG
> 16 / line ~22900 in HEAD):
>
> https://github.com/postgres/postgres/blob/master/src/backend/commands/tablecmds.c#L21923
>
> ```c
> /* Silently do nothing if already in the right state */
> currParent = partIdx->rd_rel->relispartition ?
> get_partition_parent(partIdxId, false) : InvalidOid;
> if (currParent != RelationGetRelid(parentIdx))
> {
> // ... all validation checks and attachment logic ...
> validatePartitionedIndex(parentIdx, parentTbl); // ONLY called here
> }
> // If already attached, entire block is skipped — no validation!
> ```
>
> When the child is already attached (`currParent == parentIdx`), the
> condition is false, the entire if-block is skipped, and
> `validatePartitionedIndex()` is never called. The comment "Silently do
> nothing if already in the right state" is misleading "already
> attached" does not mean "parent validity is correct."
>
> Proposed Fixes:
>
> Fix 1 : Always Validate Parent Index in ALTER INDEX ATTACH PARTITION
>
> Patch File : 0001-Always-validate-parent-index-in-ALTER-INDEX-ATTACH.patch
>
> Move the validatePartitionedIndex() call outside the if-block so it runs
> unconditionally — both when a new attachment is made and when the
> partition is
> already attached. This provides a user-accessible recovery path: after
> fixing a
> child index with REINDEX, re-running ALTER INDEX ATTACH PARTITION triggers
> parent validation.
>
> When the partition is already attached, a NOTICE is emitted:
>
> NOTICE: partition index "child_idx" is already attached to
> "parent_idx", validating parent index
>
>
> This follows PostgreSQL's existing convention of using NOTICE for
> informational messages about no-op or reduced-scope operations (e.g.,
> DROP TABLE IF EXISTS, CREATE INDEX IF NOT EXISTS). It tells the user:
>
> 1- Nothing went wrong
> 2- The index was already attached (so they know the state)
> 3- Validation still happened (so they know the fix path works)
>
>
> Fix 2: Validate Parent Partitioned Index After REINDEX of Child
>
> Patch File : 0001-Validate-parent-partitioned-index-after-REINDEX.patch
>
> Same underlying bug but this patch addresses it from the
> REINDEX side. When a partition index is repaired via REINDEX or
> REINDEX CONCURRENTLY, the parent partitioned index remains permanently
> stuck with indisvalid = false even though all children are now valid.
>
> This is because validatePartitionedIndex() — the only function that can
> mark a partitioned index as valid is never called from any REINDEX code
> path.
>
>
> validatePartitionedIndex() is only called during:
>
> 1- ALTER INDEX ... ATTACH PARTITION (tablecmds.c)
> 2- ALTER TABLE ... ATTACH PARTITION (tablecmds.c)
> 3- CREATE INDEX on partitioned tables (indexcmds.c)
>
> It is NOT called after:
>
> 1- REINDEX INDEX (regular) — handled by reindex_index() in index.c
> 2- REINDEX INDEX CONCURRENTLY — handled by ReindexRelationConcurrently()
>
> in indexcmds.c, which uses index_concurrently_swap() in index.c
>
> Three changes are made:
>
> 1. Make validatePartitionedIndex() public
> The function was static in tablecmds.c. It is now exported via
> tablecmds.h so it can be called from index.c and indexcmds.c.
>
> Files changed:
>
> src/backend/commands/tablecmds.c — remove static, update comment
> src/include/commands/tablecmds.h — add extern declaration
>
> 2. Call from reindex_index() (regular REINDEX)
> After reindex_index() marks a partition index as valid (indisvalid = true),
> check if the index is a partition (iRel->rd_rel->relispartition) and if so,
> look up the parent and call validatePartitionedIndex().
>
> A CommandCounterIncrement() is required before the call so that the child's
> updated indisvalid is visible to the syscache lookup that
> validatePartitionedIndex() performs internally.
>
> File changed: src/backend/catalog/index.c
>
> 3. Call from ReindexRelationConcurrently() (REINDEX CONCURRENTLY)
> REINDEX CONCURRENTLY uses a completely different code path: it creates a
> new
> index, builds it concurrently, then swaps it with the old one via
> index_concurrently_swap(). The new index inherits the old index's partition
> status during the swap.
>
> After the swap and the existing CommandCounterIncrement() (which makes the
> swap visible), check if the new index is a partition and call
> validatePartitionedIndex() on the parent.
>
> File changed: src/backend/commands/indexcmds.c
>
> Multi-level Hierarchy Support
> validatePartitionedIndex() already handles multi-level partition
> hierarchies.
> When it marks a mid-level parent valid, it checks if that parent is itself
> a
> partition and recursively validates the grandparent. No additional
> recursion
> logic is needed in the REINDEX patches.
>
>
> Thanks,
> Mohamed Ali
> Senior DBE
> AWS RDS
>
Hi, Mohamed
Thanks for working on this. I went through the problem statement and the
two proposed fixes. I agree that the underlying issue looks real: after
repairing an invalid child index with REINDEX, the parent partitioned index
can remain stuck in an invalid state because validatePartitionedIndex() is
not reached from the relevant REINDEX paths. The analysis around
ATExecAttachPartitionIdx() also looks correct: when the child index is
already attached, the current code takes the no-op path and skips the
validation call entirely.
Overall, I think this is worth fixing, but I do not view the two patches
equally.
I think patch 2 is the real fix. The state transition that matters here is
that a child index goes from invalid to valid, and the natural place to
trigger parent revalidation is where that transition actually happens,
namely in REINDEX. By contrast, patch 1 feels more like a secondary
hardening/workaround path: it makes ALTER INDEX ... ATTACH PARTITION retry
parent validation even in the already-attached case, but that is not really
where the underlying state change happens.
My main comments are below.
1. Patch 2 should be treated as the primary fix
This seems like the correct place to repair the catalog state. If REINDEX
repairs a partition index that was previously invalid, and that index is
attached to a partitioned parent index, then rechecking the parent with
validatePartitionedIndex() is a reasonable and direct solution.
I also think it is good that the patch covers both the regular REINDEX path
and the REINDEX CONCURRENTLY path. Those are distinct enough that both need
explicit attention, and the extra CommandCounterIncrement() before
revalidation also seems necessary.
So at a high level, this patch makes sense to me.
2. I am less convinced by patch 1 in its current form
The main issue here is not correctness, but design and placement.
Once the child is already attached, ALTER INDEX ... ATTACH PARTITION is
conceptually supposed to be a no-op. With this patch, it becomes a generic
“retry parent validation” hook. That means users can run an attach command
that does not change attachment state at all, yet still triggers a full
parent validation attempt.
That is especially questionable if there are still other invalid child
indexes elsewhere under the same parent. In that case, each
already-attached ATTACH command will do the validation work again, but it
is already known in advance that the parent still cannot become valid. So
this is not “reinvalidating” anything, but it is repeated checking with no
state change, which feels misplaced on a no-op path.
Because of that, I do not think patch 1 should be the main bug fix. At
most, I would see it as an optional hardening patch.
3. The NOTICE added by patch 1 does not seem like a good fit
The existing code explicitly says “Silently do nothing if already in the
right state”. Changing that into a NOTICE every time we hit the
already-attached case seems noisy to me.
If the community decides that the validation call should stay in this path,
I would still suggest dropping the NOTICE. The command is syntactically an
ATTACH command, not a repair command, and emitting a message for an
otherwise no-op case does not feel very PostgreSQL-like.
4. Please double-check coverage of all REINDEX entry points
I agree with the direction in patch 2, but I think reviewers will want
confidence that all relevant REINDEX flows are covered consistently.
For example, it would be good to confirm that the fix behaves correctly not
just for REINDEX INDEX, but also for the broader forms that eventually
reach the same logic, such as REINDEX TABLE, and that there is no alternate
path where a repaired child index can still leave the parent stale.
5. Multi-level partition trees need explicit testing
One especially important point here is recursion. If a repaired child index
causes its immediate parent partitioned index to become valid, and that
parent is itself a child of another partitioned index, we need to be sure
that validity propagates all the way up as intended.
The current reasoning suggests that validatePartitionedIndex() already
handles that, but this is important enough that it should be demonstrated
with a regression test, not just assumed.
Minor comments:
-
In patch 1, I would avoid turning the already-attached path into a
behavioral special case unless there is a strong reason to keep it. It
would be cleaner if the fix relied primarily on the actual state-changing
paths.
-
If patch 1 remains, I would remove the NOTICE.
-
The commit message for patch 2 should clearly explain why REINDEX is the
right place to do this, rather than making ATTACH PARTITION the repair
mechanism.
-
It may also help to mention explicitly whether the extra validation call
is only intended for indexes that were previously invalid and have just
been repaired, since that is an important part of why the patch is
reasonably narrow.
I do not think the current patches are complete without regression
coverage. At minimum, I think the following tests should be added:
1.
Regular REINDEX case
Create a partitioned table with a parent index left invalid initially,
then attach child indexes such that one child index is invalid. Repair that
child with plain REINDEX INDEX, and verify that the child becomes valid
and the parent also becomes valid.
2.
REINDEX CONCURRENTLY case
Same setup, but use REINDEX INDEX CONCURRENTLY. This should be tested
separately because the code path is different.
3.
Multi-level partition hierarchy
Use at least a three-level hierarchy and verify that repairing a
leaf-level invalid child index can cause validity to propagate upward
through intermediate partitioned indexes.
4.
Negative case
Repair one child index while another child index under the same parent
remains invalid, and verify that the parent remains invalid.
5.
Already-attached ATTACH PARTITION case
Only if patch 1 is kept: exercise ALTER INDEX ... ATTACH PARTITION on a
child index that is already attached, and verify both behaviors:
-
parent becomes valid if all children are now valid
-
parent stays invalid if some other child is still invalid
My overall view is:
- The bug itself looks real.
- The diagnosis looks sound.
- Patch 2 is the right direction and should be the primary fix.
- Patch 1 is much less compelling as written, and I would not take it as
the main solution.
- The series needs regression tests before it is ready.
Regards,
Haibo
| From | Date | Subject | |
|---|---|---|---|
| Next Message | CharSyam | 2026-04-11 01:17:07 | Re: [PATCH] Use cached hash to skip unnecessary key comparisons in dshash |
| Previous Message | Haibo Yan | 2026-04-10 23:34:21 | Re: Eliminating SPI / SQL from some RI triggers - take 3 |