| From: | Antonin Houska <ah(at)cybertec(dot)at> |
|---|---|
| To: | Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
| Subject: | Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements |
| Date: | 2025-11-27 16:56:25 |
| Message-ID: | 17483.1764262585@localhost |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com> wrote:
> I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
> improvements) in some lighter way.
I haven't read the whole thread yet, but the effort to minimize the impact of
C/RIC on VACUUM seems to prevail. Following is one more proposal. The core
idea is that C/RIC should avoid indexing dead tuples, however snapshot is not
necessary to distinguish dead tuple from a live one. And w/o snapshot, the
backend executing C/RIC does not restrict VACUUM on other tables.
Concurrent (re)build of unique index appears to be another topic of this
thread, but I think this approach should handle the problem too. The workflow
is:
1. Create an empty index.
2. Wait until all transactions are aware of the index, so they take the new
index into account when deciding on new HOT chains. (This is already
implemented.)
3. Set the 'indisready' flag so the index is ready for insertions.
4. While other transactions can insert their tuples into the index now,
process the table one page at a time this way:
4.1 Acquire (shared) content lock on the buffer.
4.3 Collect the root tuples of HOT chains - these and only these need to be
inserted into the index.
4.4 Unlock the buffer.
5. Once the whole table is processed, insert the collected tuples into the
index.
To avoid insertions of tuples that concurrent transactions have just
inserted, we'd need something like index.c:validate_index() (i.e. insert
into the index only the tuples that it does not contain yet), but w/o
snapshot because we already have the heap tuples collected.
Also it'd make sense to wait for completion of all the transactions that
currently have the table locked for INSERT/UPDATE: some of these might have
inserted their tuples into the heap, but not yet into the index. If we
included some of those tuples into our collection and insert them into the
index first, the other transactions could end up with ERROR when inserting
those tuples again.
6. Set the 'indisvalid' flag so that the index can be used by queries.
Note on pruning: As we only deal with the root tuples of HOT chains (4.3),
page pruning triggered by queries (heap_page_prune_opt) should not be
disruptive. Actually C/RIC can do the pruning itself it it appears to be
useful. For example, if whole HOT chain should be considered DEAD by the next
VACUUM, pruning is likely (depending on the OldestXid) to remove it so that we
do not insert TID of the root tuple into the index unnecessarily.
I can even think of letting VACUUM run on the same table that C/RIC is
processing. In that case, interlocking would take place at page level: either
C/RIC or VACUUM can acquire lock for particular page, but not both. This would
be useful in cases C/RIC takes very long time.
In this case, C/RIC *must not* insert TIDs of dead tuples into the index at
all. Otherwise there could be race conditions such that VACUUM removes dead
tuples from the index and marks the corresponding heap items as UNUSED, but
C/RIC then re-inserts the index tuples.
To avoid this problem, C/RIC needs to re-check each TID before it inserts it
into the index and skip the insertion if the tuple (or the whole HOT-chain
starting at this tuple) it points to is DEAD according to the OldestXmin that
the most recent VACUUM used. (VACUUM could perhaps advertise its OldestXmin
for C/RIC via shared memory.)
Also, before this re-checking starts, it must be ensured that VACUUM does not
start again, until the index creation is complete: a new run of VACUUM implies
a new value of OldestXmin, i.e. need for more stringent re-checking of the
heap tuples.
Related question is which OldestXmin to use in the step 4.3. One option is to
use *exactly* the OldestXmin shared VACUUM. However that wouldn't work if
VACUUM starts while C/RIC is already in progress. (Which seems like a
significant restriction.)
Another option is to get the OldestXmin in the same way as VACUUM
does. However, the value can thus be different from the one used by VACUUM:
older if retrieved before VACUUM started and newer if retrieved while VACUUM
was already running. The first case can be handled by the heap tuple
re-checking (see above). The latter implies that, before setting 'indisvalid',
C/RIC has to wait until all snapshots have their xmin >= this (more recent)
OldestXmin. Otherwise some snapshots could miss data they should see.
(An implication of the rule that C/RIC must not insert TIDs of dead tuples
into the index is that VACUUM does not have to call the index AM bulk delete
while C/RIC is running for that index. This would be just an optimization.)
Of course, I could have missed some important point, so please explain why
this concept is broken :-) Or let me know if something needs to be explained
more in detail. Thanks.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2025-11-27 17:00:37 | Re: Partial hash index is not used for implied qual. |
| Previous Message | Tom Lane | 2025-11-27 16:55:21 | Re: Second RewriteQuery complains about first RewriteQuery in edge case |