From 47bd703c89f60233763c9dd5ac179e45fe41b551 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Wed, 25 Mar 2026 16:58:09 -0400 Subject: [PATCH v28 08/11] heapam: Add index scan I/O prefetching. This commit implements I/O prefetching for index scans (and index-only scans that require heap fetches). This was made possible by the recent addition of batching interfaces to both the table AM and index AM APIs. The amgetbatch index AM interface provides batches of matching TIDs (rather than one tuple at a time), each of which must be taken from index tuples that appear together on a single index page. This allows multiple batches to be held open simultaneously. Giving the table AM an explicit understanding of index AM concepts/index page boundaries allows it to consider all of the relevant costs and benefits. Prefetching is implemented using a prefetching position under the control of the table AM. This is closely related to the scan position added by commit FIXME, which introduced the amgetbatch interface. A read stream callback advances the read stream as needed to provide sufficiently many heap block numbers to maintain the read stream's target prefetch distance. Testing has shown that index prefetching can make index scans much faster. Large range scans that return many tuples can be as much as 30x faster with local SSDs when buffered I/O is used, and 50x faster or more with higher-latency storage such as network-attached block devices, where the benefit of hiding I/O latency through prefetching is even greater. An important goal of the amgetbatch design is to enable the table AM's read stream callback to advance its prefetch position using TIDs that appear on a leaf page that's ahead of the current scan position's leaf page. This is crucial with scans of indexes where each leaf page happens to have relatively few distinct heap blocks among its matching TIDs (as well as with scans with leaf pages that have relatively few total matching items). Index scans can have as many as 64 open batches, which testing has shown to be about the maximum number that can ever be useful. Batches are maintained in scan order using a simple ring buffer data structure. In rare cases where the scan exceeds this quasi-arbitrary limit of 64, the read stream is temporarily paused using the read stream pausing mechanism added by commit 38229cb9. Prefetching (via the read stream) is resumed only after the scan position advances beyond its current open batch and then frees and removes the batch from the scan's batch ring buffer. Testing has shown that it isn't very common for scans to hold open more than about 10 batches to get the desired I/O prefetch distance. The heuristic used to decide when to begin prefetching delays initialization of the scan's read stream until the scan must read a fourth heap page. Note that the rule is the same for index-only scans. As a result, index-only scans won't create a read stream whenever they require no (or only very few) heap fetches. A new GUC (enable_indexscan_prefetch) controls the use of index prefetching. The default setting is 'on', so all amgetbatch index scans use prefetching. Index-only scans apply the usual "start prefetching on the fourth heap page" test to gate prefetching, and so will never create a read stream in cases where all (or almost all) relevant visibility map bits are set. Author: Tomas Vondra Author: Peter Geoghegan Reviewed-by: Andres Freund Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com --- src/include/access/heapam.h | 16 +- src/include/access/indexbatch.h | 228 +++++++++- src/include/access/relscan.h | 7 + src/include/optimizer/cost.h | 1 + src/backend/access/heap/heapam_indexscan.c | 417 +++++++++++++++++- src/backend/access/index/indexbatch.c | 35 +- src/backend/optimizer/path/costsize.c | 1 + src/backend/utils/misc/guc_parameters.dat | 7 + src/backend/utils/misc/postgresql.conf.sample | 1 + doc/src/sgml/config.sgml | 16 + doc/src/sgml/indexam.sgml | 107 ++++- doc/src/sgml/tableam.sgml | 7 + src/test/regress/expected/sysviews.out | 3 +- src/tools/pgindent/typedefs.list | 1 + 14 files changed, 825 insertions(+), 22 deletions(-) diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 71b6420c9..986b5dbe9 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -135,7 +135,21 @@ typedef struct IndexScanHeapData /* Plain index scan xs_lastinblock optimization */ bool xs_lastinblock; /* last TID on this block in current batch? */ - uint16 xs_blkswitch_count; /* number of heap blocks fetched */ + /* + * Read stream state for prefetching (only used during amgetbatch scans). + * + * The read stream moves ahead of the scan's current position using its + * own prefetching position (per the tableam_util_prefetchpos_* + * conventions from indexbatch.h). The read stream is allocated early in + * the scan, and reset on rescan (and when the scan direction changes). + */ + bool xs_paused; /* paused until next batch is read? */ + bool xs_prefetching_safe; /* prefetching is safe? */ + uint16 xs_blkswitch_count; /* determines when to prefetch */ + + ScanDirection xs_read_stream_dir; /* index scan direction */ + BlockNumber xs_prefetch_block; /* last block returned to xs_read_stream */ + ReadStream *xs_read_stream; /* prefetching read stream */ /* Per-tuple context for padding "name" columns during index-only scans */ MemoryContext xs_itup_cxt; diff --git a/src/include/access/indexbatch.h b/src/include/access/indexbatch.h index 24b531705..d765059e9 100644 --- a/src/include/access/indexbatch.h +++ b/src/include/access/indexbatch.h @@ -195,6 +195,41 @@ index_scan_batch_index_opaque_dyn(IndexScanDesc scan, IndexScanBatch batch) * ---------------------------------------------------------------------------- */ +/* + * Compare two batch ring positions in the given scan direction. + * + * Returns negative if pos1 is behind pos2, 0 if equal, positive if pos1 is + * ahead of pos2. + */ +static inline int +index_scan_pos_cmp(BatchRingItemPos *pos1, BatchRingItemPos *pos2, + ScanDirection direction) +{ + int8 batchdiff; + + Assert(pos1->valid && pos2->valid); + + batchdiff = (int8) (pos1->batch - pos2->batch); + + Assert(batchdiff > -INDEX_SCAN_MAX_BATCHES && + batchdiff < INDEX_SCAN_MAX_BATCHES); + + if (batchdiff != 0) + { + /* Resolve comparison using differing batch offsets */ + return batchdiff; + } + + /* + * Resolve comparison using items[]-wise indexes from caller's positions, + * since both positions point to the same ring buffer batch + */ + if (ScanDirectionIsForward(direction)) + return pos1->item - pos2->item; + else + return pos2->item - pos1->item; +} + /* * Advance position to its next item in the batch. * @@ -296,6 +331,7 @@ tableam_util_batchscan_init(IndexScanDesc scan) Assert(scan->indexRelation->rd_indam->amgetbatch != NULL); scan->batchringbuf.scanPos.valid = false; + scan->batchringbuf.prefetchPos.valid = false; scan->batchringbuf.markPos.valid = false; scan->batchringbuf.markBatch = NULL; @@ -345,7 +381,7 @@ tableam_util_scanpos_advance(IndexScanDesc scan, ScanDirection direction, /* * scanPos is valid, so scanBatch must already be loaded in batch ring - * buffer. We rely on that here. + * buffer. We rely on that here (can't do this with prefetchBatch). */ pg_assume(batchringbuf->headBatch == scanPos->batch); @@ -357,9 +393,9 @@ tableam_util_scanpos_advance(IndexScanDesc scan, ScanDirection direction, /* * Fetch the next batch of matching items for the scan (or the first). * - * Called when caller's current batch (passed to us as priorBatch) has no more - * matching items in the given scan direction. Caller passes a NULL - * priorBatch on the first call here for the scan. + * Called when caller's current scanBatch/prefetchBatch (passed to us as + * priorBatch) has no more matching items in the given scan direction. Caller + * passes a NULL priorBatch on the first call here for the scan. * * Returns the next batch to be processed by caller in the given scan * direction, or NULL when there are no more matches in that direction. @@ -368,7 +404,7 @@ tableam_util_scanpos_advance(IndexScanDesc scan, ScanDirection direction, * * We don't free any batches here; that is a separate step performed by * tableam_util_scanpos_nextbatch. Caller also needs to advance their - * position to the start of the returned batch. + * scanPos/prefetchPos position to the start of the returned batch. */ static pg_attribute_always_inline IndexScanBatch tableam_util_fetch_next_batch(IndexScanDesc scan, ScanDirection direction, @@ -482,13 +518,19 @@ tableam_util_fetch_next_batch(IndexScanDesc scan, ScanDirection direction, * now-obsolescent old scanBatch (the ring buffer's head batch), freeing up * its ring buffer slot. (When newScanBatch is the scan's first batch, there * is no old scanBatch for us to release.) + * + * Return value indicates if a previously occupied ring buffer slot was freed. + * A table AM that paused its prefetch mechanism because the ring buffer was + * full (see tableam_util_prefetchpos_advance) can resume it when we return + * true (to indicate to caller that there's now space to store another batch). */ -static pg_attribute_always_inline void +static pg_attribute_always_inline bool tableam_util_scanpos_nextbatch(IndexScanDesc scan, ScanDirection direction, IndexScanBatch newScanBatch) { BatchRingBuffer *batchringbuf = &scan->batchringbuf; BatchRingItemPos *scanPos = &batchringbuf->scanPos; + BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos; bool releaseOldHeadBatch = scanPos->valid; IndexScanBatch headBatch; @@ -500,7 +542,7 @@ tableam_util_scanpos_nextbatch(IndexScanDesc scan, ScanDirection direction, { /* newScanBatch is the scan's first and only batch */ Assert(batchringbuf->headBatch == scanPos->batch); - return; + return false; } headBatch = index_scan_batch(scan, batchringbuf->headBatch); @@ -511,12 +553,184 @@ tableam_util_scanpos_nextbatch(IndexScanDesc scan, ScanDirection direction, /* free obsolescent head batch (unless it is scan's markBatch) */ tableam_util_release_batch(scan, headBatch); + /* + * If we're about to release the batch that prefetchPos currently points + * to, just invalidate prefetchPos. This keeps prefetchPos from ever + * falling behind scanPos at the batch granularity, which + * tableam_util_prefetchpos_catchup relies on. + */ + if (prefetchPos->valid && + prefetchPos->batch == batchringbuf->headBatch) + prefetchPos->valid = false; + /* Remove the batch from the ring buffer (even if it's markBatch) */ batchringbuf->headBatch++; /* Postconditions for having freed up a ring buffer slot */ + Assert(!prefetchPos->valid || + index_scan_batch_loaded(scan, prefetchPos->batch)); Assert(!index_scan_batch_full(scan)); Assert(batchringbuf->headBatch == scanPos->batch); + + return true; +} + +/* + * Handle initialization of the scan's prefetchPos, when prefetchPos isn't + * yet valid (also handles the prefetchPos < scanPos edge case). + * + * Called at the start of each table AM prefetch callback call. Returns true + * after setting prefetchPos to the scan's current scanPos. That's a special + * case: the prefetch callback should process the very item that the scan is + * on directly (e.g., by returning that item's table block to its read + * stream), rather than reading ahead of the scan. Returns false when + * prefetchPos is ahead of (or equal to) scanPos, in which case the prefetch + * callback picks up from where its last call left off. + */ +static inline bool +tableam_util_prefetchpos_catchup(IndexScanDesc scan, ScanDirection direction) +{ + BatchRingBuffer *batchringbuf = &scan->batchringbuf; + BatchRingItemPos *scanPos = &batchringbuf->scanPos; + BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos; + + /* + * scanPos must always be valid when prefetching takes place. There has + * to be at least one batch, loaded as the scan's scanBatch. + */ + Assert(index_scan_batch_count(scan) > 0); + Assert(scanPos->valid && index_scan_batch_loaded(scan, scanPos->batch)); + + /* + * prefetchPos can "fall behind" scanPos at the item granularity: the + * prefetch callback only runs on demand, so scanPos can overtake + * prefetchPos whenever the scan consumes items without the callback being + * called (e.g., runs of adjacent matching items whose TIDs all point to + * the same table block). We handle that case using exactly the same + * steps as initialization. + * + * prefetchPos can never fall behind scanPos at the batch granularity, + * since tableam_util_scanpos_nextbatch invalidates prefetchPos before + * releasing the batch that prefetchPos points to. There is therefore no + * danger of prefetchPos.batch falling so far behind scanPos.batch that it + * wraps around (and appears to be ahead of scanPos instead of behind it). + */ + if (!prefetchPos->valid || + index_scan_pos_cmp(scanPos, prefetchPos, direction) > 0) + { + *prefetchPos = *scanPos; + return true; + } + + /* Picking up prefetching from where the last callback call left off */ + Assert(index_scan_pos_cmp(scanPos, prefetchPos, direction) <= 0); + return false; +} + +/* + * Result of a tableam_util_prefetchpos_advance call + */ +typedef enum BatchPosAdvanceResult +{ + BATCH_POS_ADVANCED, /* advanced to next item in current batch */ + BATCH_POS_BATCH_ADVANCED, /* advanced to first item of new batch */ + BATCH_POS_DONE, /* no further matching items in direction */ + BATCH_POS_RING_FULL, /* couldn't advance; ring buffer full */ +} BatchPosAdvanceResult; + +/* + * Advance the scan's prefetchPos to the next item that the table AM's + * prefetch callback should consider reading ahead, moving in the given scan + * direction. + * + * On entry, *prefetchBatch must be the batch that prefetchPos points to. + * Advances prefetchPos to the next item within *prefetchBatch when possible + * (returns BATCH_POS_ADVANCED). Otherwise tries to advance to the scan's + * next batch, setting *prefetchBatch to the new batch and positioning + * prefetchPos at its first item in the scan direction (returns + * BATCH_POS_BATCH_ADVANCED). Callers must use the returned result (never + * compare *prefetchBatch against its earlier value) to detect this case; + * batch recycling can reuse the memory of a recently released batch. + * + * Returns BATCH_POS_DONE when there are no further matching items in the + * given scan direction (*prefetchBatch is set to NULL). + * + * Returns BATCH_POS_RING_FULL when the next batch couldn't be loaded because + * all available ring buffer batch slots are currently in use (prefetchPos + * and *prefetchBatch are left unchanged). Caller responds by momentarily + * pausing its read-ahead mechanism; it can be resumed once + * tableam_util_scanpos_nextbatch reports that the scan freed up a slot + * (which'll happen only after scanPos has consumed all remaining items from + * the scan's current scanBatch). + * + * When caller passes throttle=true we likewise decline to advance to the next + * batch and return BATCH_POS_RING_FULL instead. Caller uses this to cap how + * many batches a single read-ahead callback invocation can advance by. + * Advancing within the current batch (BATCH_POS_ADVANCED) ignores throttle, + * so throttling only takes effect at a batch boundary. + */ +static inline BatchPosAdvanceResult +tableam_util_prefetchpos_advance(IndexScanDesc scan, ScanDirection direction, + IndexScanBatch *prefetchBatch, + BatchRingItemPos *prefetchPos, + bool throttle) +{ + if (!index_scan_pos_advance(direction, *prefetchBatch, prefetchPos)) + { + /* + * Ran out of items from prefetchBatch. Try to advance to the scan's + * next batch. + */ + if (unlikely(index_scan_batch_full(scan)) || unlikely(throttle)) + { + /* + * Can't advance prefetchBatch because all available ring buffer + * batch slots are currently in use (or because caller wants us to + * throttle instead of returning another batch). Undo the changes + * we've already made to prefetchPos before returning, leaving it + * in a state that's consistent with the work actually performed + * (various positional state assertions expect this). + */ + if (ScanDirectionIsForward(direction)) + { + Assert(prefetchPos->item == (*prefetchBatch)->lastItem + 1); + prefetchPos->item--; + } + else /* ScanDirectionIsBackward */ + { + Assert(prefetchPos->item == (*prefetchBatch)->firstItem - 1); + prefetchPos->item++; + } + + return BATCH_POS_RING_FULL; + } + + /* We have a free ring buffer slot to fit another batch */ + *prefetchBatch = tableam_util_fetch_next_batch(scan, direction, + *prefetchBatch, + prefetchPos); + if (*prefetchBatch == NULL) + { + /* + * Deliberately leave prefetchPos in "just-before-start" or + * "just-after-end" position + */ + return BATCH_POS_DONE; + } + + /* + * Have a new prefetchBatch. + * + * tableam_util_fetch_next_batch already appended the new batch to the + * ring buffer for us, but we must advance prefetchPos ourselves. + * Position prefetchPos to the start of the new batch. + */ + index_scan_pos_startbatch(direction, *prefetchBatch, prefetchPos); + + return BATCH_POS_BATCH_ADVANCED; + } + + return BATCH_POS_ADVANCED; } /* diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index 3a1e616d3..18c35a6f4 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -186,6 +186,10 @@ typedef struct IndexScanBatchData * This allows table AMs to avoid redundant amgetbatch calls with the same * priorbatch -- the index AM might need to read additional index pages to * determine there are no more matching items beyond caller's priorbatch. + * In particular, during prefetching the read stream callback discovers + * the end-of-scan via prefetchBatch. tableam_util_fetch_next_batch() + * checks these flags so that the scan side doesn't repeat the same + * amgetbatch call when it later reaches that batch as scanBatch. */ bool knownEndBackward; bool knownEndForward; @@ -236,11 +240,14 @@ typedef struct IndexScanBatchData *IndexScanBatch; * current read position by _multiple_ batches/index pages. The further out * the table AM reads ahead like this, the further it can see into the future. * That way the table AM is able to reorder work as aggressively as desired. + * Index scans sometimes need to readahead by several dozen batches in order + * to maintain an optimal I/O prefetch distance (for reading table blocks). */ typedef struct BatchRingBuffer { /* current positions in IndexScanDescData.batchbuf[] for scan */ BatchRingItemPos scanPos; /* scan's read position */ + BatchRingItemPos prefetchPos; /* prefetching position */ BatchRingItemPos markPos; /* mark/restore position */ /* markPos's batch (not in ring buffer when markBatch != scanBatch) */ diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h index f2fd5d315..419300a6b 100644 --- a/src/include/optimizer/cost.h +++ b/src/include/optimizer/cost.h @@ -52,6 +52,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather; extern PGDLLIMPORT bool enable_seqscan; extern PGDLLIMPORT bool enable_indexscan; extern PGDLLIMPORT bool enable_indexonlyscan; +extern PGDLLIMPORT bool enable_indexscan_prefetch; extern PGDLLIMPORT bool enable_bitmapscan; extern PGDLLIMPORT bool enable_tidscan; extern PGDLLIMPORT bool enable_sort; diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c index e9b1ea851..5ea3c0cca 100644 --- a/src/backend/access/heap/heapam_indexscan.c +++ b/src/backend/access/heap/heapam_indexscan.c @@ -19,11 +19,24 @@ #include "access/indexbatch.h" #include "access/relscan.h" #include "access/visibilitymap.h" +#include "optimizer/cost.h" #include "storage/predicate.h" #include "utils/builtins.h" #include "utils/memutils.h" #include "utils/pgstat_internal.h" +/* + * We avoid creating a read stream during very selective scans that require + * few heap fetches, where the overhead of creating a read stream is unlikely + * to pay for itself + */ +#define INDEX_PREFETCH_BLKSWITCH_THRESHOLD 4 + +/* + * Maximum number of batches that a single heapam_index_prefetch_next_block + * call may advance prefetchBatch by without returning a heap block + */ +#define INDEX_PREFETCH_MAX_BATCH_ADVANCES 1 /* * heapam's per-batch private opaque area (only used during index-only scans). @@ -91,6 +104,14 @@ static void heapam_index_batch_pos_visibility(IndexScanDesc scan, IndexScanBatch batch, HeapBatchData *hbatch, BatchRingItemPos *pos); +static pg_noinline void heapam_index_dirchange_reset(IndexScanDesc scan, + IndexScanHeapData *hscan, + ScanDirection direction); +static pg_attribute_always_inline void heapam_index_consider_prefetching(IndexScanDesc scan, + IndexScanHeapData *hscan); +static BlockNumber heapam_index_prefetch_next_block(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data); /* * Simple, single-shot TID lookup for constraint enforcement code (unique @@ -157,6 +178,10 @@ heapam_index_scan_begin(IndexScanDesc scan, uint32 flags) /* xs_lastinblock optimization state */ Assert(!hscan->xs_lastinblock); + /* Read stream state (other fields initialized by callback) */ + Assert(hscan->xs_read_stream_dir == NoMovementScanDirection); + Assert(hscan->xs_read_stream == NULL); + /* Resolve which xs_getnext_slot implementation to use for this scan */ if (scan->indexRelation->rd_indam->amgetbatch != NULL) { @@ -180,6 +205,16 @@ heapam_index_scan_begin(IndexScanDesc scan, uint32 flags) /* Set up scan's batch ring buffer */ tableam_util_batchscan_init(scan); + + /* + * We can only safely prefetch during scans where we're able to + * unguard (unpin) each batch's buffers right away (MVCC scans). We + * are not prepared to sensibly limit the total number of buffer pins + * held. The read stream handles all pin resource management for us, + * and knows nothing about pins held on index pages/within batches. + * (It's also convenient for enable_indexscan_prefetch to gate he.) + */ + hscan->xs_prefetching_safe = scan->MVCCScan && enable_indexscan_prefetch; } else { @@ -188,6 +223,9 @@ heapam_index_scan_begin(IndexScanDesc scan, uint32 flags) scan->xs_getnext_slot = heapam_index_only_tuple_getnext_slot; else scan->xs_getnext_slot = heapam_index_plain_tuple_getnext_slot; + + /* Prefetching isn't support during amgettuple scans */ + hscan->xs_prefetching_safe = false; } /* @@ -239,6 +277,15 @@ heapam_index_scan_rescan(IndexScanDesc scan) /* Heap fetches from the last rescan don't count towards this limit */ hscan->xs_blkswitch_count = 0; + /* Defensively do an unconditional read stream direction reset */ + hscan->xs_read_stream_dir = NoMovementScanDirection; + + if (hscan->xs_read_stream) + { + hscan->xs_paused = false; + read_stream_reset(hscan->xs_read_stream); + } + /* Reset batch ring buffer state */ if (scan->usebatchring) tableam_util_batchscan_reset(scan, false); @@ -263,6 +310,9 @@ heapam_index_scan_end(IndexScanDesc scan) if (BufferIsValid(hscan->xs_vmbuffer)) ReleaseBuffer(hscan->xs_vmbuffer); + if (hscan->xs_read_stream) + read_stream_end(hscan->xs_read_stream); + /* Free the index-only scan name-column context, if any */ if (hscan->xs_itup_cxt) MemoryContextDelete(hscan->xs_itup_cxt); @@ -292,9 +342,17 @@ heapam_index_scan_markpos(IndexScanDesc scan) void heapam_index_scan_restrpos(IndexScanDesc scan) { + IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque; + Assert(scan->usebatchring); Assert(scan->indexRelation->rd_indam->amcanmarkpos); + if (hscan->xs_read_stream) + { + hscan->xs_paused = false; + read_stream_reset(hscan->xs_read_stream); + } + tableam_util_batchscan_restore_pos(scan); } @@ -627,6 +685,15 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction, bool all_visible = false; ItemPointer tid = NULL; + /* + * Changing the scan direction mid-scan requires an MVCC snapshot: with + * any other snapshot type, more than one member of a HOT chain can be + * visible, and resuming a partially-returned chain only works in the + * forward direction. All non-MVCC callers scan in one fixed direction. + */ + Assert(scan->MVCCScan || !amgetbatch || + hscan->xs_read_stream_dir == NoMovementScanDirection || + hscan->xs_read_stream_dir == direction); Assert(TransactionIdIsValid(RecentXmin)); Assert(index_only || scan->xs_visited_pages_limit == 0); @@ -787,9 +854,13 @@ heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan, hscan->xs_blk = ItemPointerGetBlockNumber(tid); /* - * We're switching to a new heap block, so count it + * We're switching to a new heap block, so count it; once enough + * distinct blocks are fetched, start prefetching (though only if we + * haven't already) */ - hscan->xs_blkswitch_count++; + if (hscan->xs_read_stream == NULL && + ++hscan->xs_blkswitch_count == INDEX_PREFETCH_BLKSWITCH_THRESHOLD) + heapam_index_consider_prefetching(scan, hscan); /* * Drop the xs_blk pin independently held on by slot (if any) now, @@ -803,7 +874,14 @@ heapam_index_heap_fetch(IndexScanDesc scan, IndexScanHeapData *hscan, if (BufferIsValid(hscan->xs_cbuf)) ReleaseBuffer(hscan->xs_cbuf); - hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk); + /* + * When using a read stream, the stream will already know which block + * number comes next (though an assertion will verify a match below) + */ + if (hscan->xs_read_stream) + hscan->xs_cbuf = read_stream_next_buffer(hscan->xs_read_stream, NULL); + else + hscan->xs_cbuf = ReadBuffer(rel, hscan->xs_blk); /* * Prune page when it is pinned for the first time @@ -930,6 +1008,12 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexScanHeapData *hscan, Assert(all_visible == NULL || scan->xs_want_itup); + /* Handle resetting the read stream when scan direction changes */ + if (hscan->xs_read_stream_dir == NoMovementScanDirection) + hscan->xs_read_stream_dir = direction; /* first call */ + else if (unlikely(hscan->xs_read_stream_dir != direction)) + heapam_index_dirchange_reset(scan, hscan, direction); + /* * Attempt to increment the position of any existing loaded scanBatch * (always fails on first call here for the scan) @@ -973,7 +1057,25 @@ heapam_index_getnext_scanbatch_pos(IndexScanDesc scan, IndexScanHeapData *hscan, * also remove the old head batch/scanBatch from the batch ring buffer, * and release the underlying batch storage. */ - tableam_util_scanpos_nextbatch(scan, direction, scanBatch); + if (tableam_util_scanpos_nextbatch(scan, direction, scanBatch)) + { + /* A previously occupied ring buffer slot was freed */ + if (unlikely(hscan->xs_paused)) + { + /* + * heapam_index_prefetch_next_block paused the scan's read stream + * due to our running out of batch slots (or it "throttled" the + * read stream to avoid reading too far ahead in the index). + * + * Now that the scanBatch that was current when we paused has been + * removed from the batch ring buffer, we must resume prefetching. + */ + read_stream_resume(hscan->xs_read_stream); + hscan->xs_paused = false; + } + } + + Assert(!hscan->xs_paused); /* * Set scanPos to first item for newly loaded scanBatch; return the new @@ -1099,6 +1201,13 @@ heapam_index_return_scanpos_tid(IndexScanDesc scan, IndexScanHeapData *hscan, * (important for inner index scans of anti-joins and semi-joins), and the * need to unguard batches promptly. * + * In no event will the scan be allowed to guard more than one batch at a + * time. The primary reason for this restriction is to avoid unintended + * interactions with the read stream, which has its own strategy for keeping + * the number of pins held by the backend under control. (Unguarding via + * the amunguardbatch callback often means releasing a buffer pin on an + * index page, which counts against the same shared pin limit.) + * * Once we've resolved visibility for all items in a batch, we can safely * unguard it by calling amunguardbatch. This is safe with respect to * concurrent VACUUM because the batch's guard (typically a buffer pin on the @@ -1247,3 +1356,303 @@ heapam_index_batch_pos_visibility(IndexScanDesc scan, ScanDirection direction, else hscan->xs_vm_items = scan->maxitemsbatch; } + +/* + * Handle a change in index scan direction (at the tuple granularity). + * + * Resets the read stream, since we can't rely on scanPos continuing to agree + * with the blocks that read stream already consumed using prefetchPos. + * + * Note: iff the scan _continues_ in this new direction, and actually steps + * off scanBatch to an earlier index page, tableam_util_fetch_next_batch will + * deal with it. But that might never happen; the scan might yet change + * direction again (or just end before returning more items). + */ +static pg_noinline void +heapam_index_dirchange_reset(IndexScanDesc scan, IndexScanHeapData *hscan, + ScanDirection direction) +{ + /* Reset read stream state */ + scan->batchringbuf.prefetchPos.valid = false; + hscan->xs_paused = false; + hscan->xs_read_stream_dir = direction; + hscan->xs_blkswitch_count = 0; + + /* Reset read stream itself */ + if (hscan->xs_read_stream) + read_stream_reset(hscan->xs_read_stream); +} + +/* + * Start a read stream for heap block prefetching during an index scan + */ +static pg_attribute_always_inline void +heapam_index_consider_prefetching(IndexScanDesc scan, + IndexScanHeapData *hscan) +{ + Assert(hscan->xs_blk != InvalidBlockNumber); + Assert(!hscan->xs_read_stream); + Assert(!scan->batchringbuf.prefetchPos.valid); + + if (!hscan->xs_prefetching_safe) + return; + + hscan->xs_read_stream = + read_stream_begin_relation(READ_STREAM_DEFAULT, NULL, + scan->heapRelation, MAIN_FORKNUM, + heapam_index_prefetch_next_block, scan, 0); +} + +/* + * Return the next block to the read stream when performing index prefetching. + * + * The initial batch is always loaded by heapam_index_getnext_scanbatch_pos. + * We don't get called until the first read_stream_next_buffer call, when a + * heap block is requested from the scan's stream for the first time. + * + * The position of the read stream is stored in prefetchPos, which typically + * stays ahead of scanPos (the scan's read position). When we return, we + * always leave scanPos <= prefetchPos. + */ +static BlockNumber +heapam_index_prefetch_next_block(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data) +{ + IndexScanDesc scan = (IndexScanDesc) callback_private_data; + IndexScanHeapData *hscan = (IndexScanHeapData *) scan->xs_table_opaque; + BatchRingBuffer *batchringbuf = &scan->batchringbuf; + BatchRingItemPos *scanPos PG_USED_FOR_ASSERTS_ONLY = &batchringbuf->scanPos; + BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos; + ScanDirection direction = hscan->xs_read_stream_dir; + IndexScanBatch prefetchBatch; + HeapBatchData *hbatch = NULL; + int nbatchadvances_this_call = 0; + + Assert(!hscan->xs_paused && hscan->xs_prefetching_safe); + Assert(direction != NoMovementScanDirection); + + /* + * Handle initialization of prefetchPos: set it from the scan's current + * scanPos when it isn't already (validly) ahead of scanPos. This is + * required during the first call here for the scan (and in certain edge + * cases). See tableam_util_prefetchpos_catchup for full details. + */ + if (tableam_util_prefetchpos_catchup(scan, direction)) + { + BatchMatchingItem *item; + + /* prefetchPos has been initialized from scanPos for us */ + prefetchBatch = index_scan_batch(scan, prefetchPos->batch); + + /* + * We must avoid keeping any batch guarded for more than an instant, + * to avoid undesirable interactions with the scan's read stream. See + * comment and assertion at the top of the loop below. + */ + if (scan->xs_want_itup) + { + /* + * Index-only scan batches aren't unguarded immediately. Deal + * with that. + */ + hbatch = index_scan_batch_table_area(scan, prefetchBatch); + + /* + * The requested item can't be all-visible according to its + * batch's cached visibility information; if it were, we'd never + * have been called in the first place + */ + Assert(HEAP_BATCH_VIS_CACHED(hbatch, prefetchPos->item) && + !hbatch->batchvis[prefetchPos->item]); + + /* + * Load any visibility info not already set through scanBatch, so + * that scanBatch/prefetchBatch is unguarded right away + */ + hscan->xs_vm_items = scan->maxitemsbatch; /* must unguard */ + if (prefetchBatch->isGuarded) + heapam_index_batch_pos_visibility(scan, direction, + prefetchBatch, hbatch, + prefetchPos); + + /* + * Later calls to heapam_index_batch_pos_visibility will always + * unguard batches right away, which we rely on in the loop below + */ + } + + Assert(!prefetchBatch->isGuarded); + + item = &prefetchBatch->items[prefetchPos->item]; + hscan->xs_prefetch_block = ItemPointerGetBlockNumber(&item->tableTid); + + /* + * Special case: when we return, prefetchPos won't be ahead of scanPos + * (it'll just be equal to scanPos). We're merely fetching through a + * read stream; true prefetching hasn't really started yet. + */ + Assert(index_scan_pos_cmp(scanPos, prefetchPos, direction) == 0); + + return hscan->xs_prefetch_block; + } + + /* + * We're picking up prefetching from where the last call here left off + */ + Assert(index_scan_pos_cmp(scanPos, prefetchPos, direction) <= 0); + prefetchBatch = index_scan_batch(scan, prefetchPos->batch); + if (scan->xs_want_itup) + hbatch = index_scan_batch_table_area(scan, prefetchBatch); + + /* + * Assert in passing that xs_prefetch_block matches the last item we + * returned. + * + * Note: we don't actually need a xs_prefetch_block field at all; we could + * just take the last block we returned from prefetchPos directly instead. + * But maintaining xs_prefetch_block explicitly is slightly more robust. + * It gives us a way to make sure that the last call here left prefetchPos + * in a consistent state (e.g., when the read stream had to be paused). + */ +#ifdef USE_ASSERT_CHECKING + { + BatchMatchingItem *lastitem = &prefetchBatch->items[prefetchPos->item]; + BlockNumber last_block = ItemPointerGetBlockNumber(&lastitem->tableTid); + + /* + * Note: when a previous call paused the read stream, prefetchPos + * might point to an item whose TID doesn't match last_block. This + * can only happen when the item was never returned due to it being + * all-visible. + */ + Assert(last_block == hscan->xs_prefetch_block || + (hbatch && HEAP_BATCH_VIS_CACHED(hbatch, prefetchPos->item) && + hbatch->batchvis[prefetchPos->item])); + } +#endif + + for (;;) + { + BatchMatchingItem *item; + BlockNumber prefetch_block; + bool throttle; + + /* + * We never call amgetbatch without immediately unguarding the batch + * once prefetching begins. That way index AMs won't hold onto any + * "extra" index page pins needed as TID recycling interlock guards. + * + * This is defensive. The read stream tries to be careful about not + * pinning too many buffers, and that's harder to do reliably if there + * are variable numbers of pins taken without such care. + */ + Assert(!prefetchBatch->isGuarded); + + /* + * Before advancing prefetchPos, consider if read stream's current + * call here already advanced prefetchBatch. This is possible during + * index-only scans with long runs of batches containing only items + * that are all-visible (it's also possible during plain index scans + * with unusual batch layouts, though that's much less common). + * + * When we detect this condition, we forcibly throttle prefetching, + * which pauses the read stream. That'll give scanPos the opportunity + * to return the next item to the scan. We impose a ceiling on how + * far prefetchBatch can get ahead of scanBatch without our producing + * even one additional heap block for the read stream to prefetch. + */ + throttle = nbatchadvances_this_call >= INDEX_PREFETCH_MAX_BATCH_ADVANCES; + + /* Increment prefetchPos to determine the next item to prefetch */ + switch (tableam_util_prefetchpos_advance(scan, direction, + &prefetchBatch, prefetchPos, + throttle)) + { + case BATCH_POS_ADVANCED: + /* Advanced to next item in current/previous prefetchBatch */ + break; + case BATCH_POS_BATCH_ADVANCED: + /* Advanced to first item in new prefetchBatch */ + nbatchadvances_this_call++; + if (hbatch) + { + /* + * Extra heapam-specific step: bulk-load visibility info + * up front to unguard batch immediately + */ + Assert(scan->xs_want_itup); + + hbatch = index_scan_batch_table_area(scan, prefetchBatch); + + Assert(hscan->xs_vm_items == scan->maxitemsbatch); + if (prefetchBatch->isGuarded) + heapam_index_batch_pos_visibility(scan, direction, + prefetchBatch, + hbatch, prefetchPos); + } + break; + case BATCH_POS_DONE: + /* No more batches in this scan direction */ + return InvalidBlockNumber; + case BATCH_POS_RING_FULL: + + /* + * Edge case: Ran out of items from prefetchBatch, but can't + * advance to the scan's next batch right now (all available + * batchringbuf batch slots are currently in use). This also + * happens when we deliberately throttled prefetching. + * + * Deal with this by momentarily pausing the read stream. + * heapam_index_getnext_scanbatch_pos will resume the read + * stream later, though only after scanPos has consumed all + * remaining items from scanBatch (at which point the current + * head batch will be freed, making a slot available for + * reuse). + */ + hscan->xs_paused = true; + return read_stream_pause(stream); + } + + /* + * prefetchPos now points to the next item whose TID's heap block + * number might need to be prefetched. + * + * scanPos must be < prefetchPos when we return from this loop path. + */ + Assert(index_scan_pos_cmp(scanPos, prefetchPos, direction) < 0); + + if (hbatch) + { + Assert(scan->xs_want_itup); + Assert(HEAP_BATCH_VIS_CACHED(hbatch, prefetchPos->item)); + + if (hbatch->batchvis[prefetchPos->item]) + { + /* item is known to be all-visible -- don't prefetch */ + continue; + } + } + + item = &prefetchBatch->items[prefetchPos->item]; + prefetch_block = ItemPointerGetBlockNumber(&item->tableTid); + + if (prefetch_block == hscan->xs_prefetch_block) + { + /* + * prefetch_block matches the last prefetchPos item's TID's heap + * block number; we must not return the same prefetch_block twice + * (twice in succession) + */ + continue; + } + + /* We have a new heap block number to return to read stream */ + hscan->xs_prefetch_block = prefetch_block; + return prefetch_block; + } + + pg_unreachable(); + + return InvalidBlockNumber; +} diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c index 2e2ccf6a9..dce2b2a55 100644 --- a/src/backend/access/index/indexbatch.c +++ b/src/backend/access/index/indexbatch.c @@ -5,15 +5,21 @@ * * This module provides the core infrastructure for batch-based index scans, * which allow index AMs to return multiple matching TIDs per page in a single - * call. The batch ring buffer is owned by the table AM. + * call. The batch ring buffer is owned by the table AM, typically maintained + * alongside a read stream used for prefetching table blocks. * - * The ring buffer loads batches in index key space/index scan order. + * The ring buffer loads batches in index key space/index scan order. This + * allows the table AM to maintain an adequate prefetch distance: prefetching + * is thereby able to request table blocks referenced by index pages that are + * well ahead of the current scan position's index page. * * Most functions here are table AM utilities (tableam_util_*), called by * table AMs during amgetbatch index scans. These manage the batch ring * buffer's lifecycle and positional state, and help with certain aspects of * resource management. The table AM uses scanPos to return items from - * batches returned by amgetbatch. + * batches returned by amgetbatch. Table AMs that support I/O prefetching of + * table blocks during index scans use prefetchPos to request table blocks + * well ahead of those that are of immediate interest to scanPos. * * There are also some index AM utilities (indexam_util_*), called by index * AMs that implement the amgetbatch interface, to help manage resources like @@ -104,6 +110,7 @@ tableam_util_batchscan_reset(IndexScanDesc scan, bool endscan) bool markBatchFreed = false; batchringbuf->scanPos.valid = false; + batchringbuf->prefetchPos.valid = false; batchringbuf->markPos.valid = false; for (uint8 i = batchringbuf->headBatch; i != batchringbuf->nextBatch; i++) @@ -215,7 +222,12 @@ tableam_util_batchscan_mark_pos(IndexScanDesc scan) * the current scanBatch when needed. * * We just discard all batches (other than markBatch/restored scanBatch), - * except when markBatch is already the scan's current scanBatch. + * except when markBatch is already the scan's current scanBatch. We always + * invalidate prefetchPos. The table AM's prefetching state (e.g., its read + * stream) is reset by the caller (which calls this function as it resets that + * state). This approach keeps things simple for table AMs: most code that + * deals with batches is thereby able to assume that the common case where + * scan direction never changes is the only case. * * Note: This relies on the assumption that we already have a valid scanPos. * Table AMs should only call tableam_util_batchscan_reset from within their @@ -242,6 +254,14 @@ tableam_util_batchscan_restore_pos(IndexScanDesc scan) Assert(markPos->item >= markBatch->firstItem && markPos->item <= markBatch->lastItem); + /* + * Restoring a mark always requires stopping prefetching. This is similar + * to the handling table AMs implement to deal with a tuple-level change + * in the scan's direction. The read stream must have already been reset + * by the table AM caller. + */ + batchringbuf->prefetchPos.valid = false; + if (scanBatch == markBatch) { /* markBatch is already scanBatch; needn't change batchringbuf */ @@ -312,6 +332,13 @@ tableam_util_batchscan_restore_pos(IndexScanDesc scan) * to determine which batch comes next in the new scan direction. This * approach isn't particularly efficient, but it works well enough for what * ought to be a relatively rare occurrence. + * + * Caller must have reset the scan's read stream before calling here. That + * needs to happen as soon as the scan requests a tuple in whatever scan + * direction is opposite-to-current. We only deal with the case where the + * scan backs up by enough items to cross a batch boundary (when the scan + * resumes scanning in its original direction/ends before crossing a boundary, + * there isn't any need to call here). */ void tableam_util_scanbatch_dirchange(IndexScanDesc scan) diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 1c575e56f..6fcb815f7 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -146,6 +146,7 @@ int max_parallel_workers_per_gather = 2; bool enable_seqscan = true; bool enable_indexscan = true; bool enable_indexonlyscan = true; +bool enable_indexscan_prefetch = true; bool enable_bitmapscan = true; bool enable_tidscan = true; bool enable_sort = true; diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat index afaa058b0..ace56f7a8 100644 --- a/src/backend/utils/misc/guc_parameters.dat +++ b/src/backend/utils/misc/guc_parameters.dat @@ -941,6 +941,13 @@ boot_val => 'true', }, +{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD', + short_desc => 'Enables prefetching for index scans and index-only scans.', + flags => 'GUC_EXPLAIN', + variable => 'enable_indexscan_prefetch', + boot_val => 'true', +}, + { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD', short_desc => 'Enables the planner\'s use of materialization.', flags => 'GUC_EXPLAIN', diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index ac38cddaa..8705dd5f3 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -431,6 +431,7 @@ #enable_incremental_sort = on #enable_indexscan = on #enable_indexonlyscan = on +#enable_indexscan_prefetch = on #enable_material = on #enable_memoize = on #enable_mergejoin = on diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index fa566c9e5..f7dc013a2 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5938,6 +5938,22 @@ ANY num_sync ( + enable_indexscan_prefetch (boolean) + + enable_indexscan_prefetch configuration parameter + + + + + Enables or disables I/O prefetching during the execution of index + scans and index-only scans. Prefetching can improve performance by + reading table AM pages ahead of when they are needed during these + scans. The default is on. + + + + enable_material (boolean) diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 490431f70..7da80d16a 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -883,9 +883,11 @@ amgetbatch (IndexScanDesc scan, time (so they can drive a cursor), as opposed to a bitmap scan (amgetbitmap), which returns all matches at once. Where amgettuple returns one matching entry per call, - amgetbatch returns them in batches. By returning all - matching index entries from a single index page together, the table AM gains - visibility into which table blocks will be needed in the near future. + amgetbatch returns them in batches. This enables the + table access method to optimize table block access patterns and perform + I/O prefetching: by returning matching index entries in batches (typically + all matches from a single index page), the table AM can read ahead through + the index, identify which table blocks will be needed, and prefetch them. @@ -1052,7 +1054,9 @@ amunguardbatch (IndexScanDesc scan, be sure to free the pins at an opportune point (at a minimum whenever amendscan is called, and typically when amrescan is called). It must also keep the number of - retained pins fixed and small. + retained pins fixed and small, to avoid exhausting the backend's buffer + pin limit (which is shared with the table AM's read stream for index scan + prefetching). @@ -1597,6 +1601,66 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); or vice versa, if its internal implementation is unsuited to one API or the other. + + Table AM Considerations for Batch Scanning + + + This section is primarily relevant to table access + method authors. + + + + When an index scan uses the amgetbatch interface, the + table AM has sole control over the IndexScanDesc's + batchringbuf, including creating, resetting, + and ending the batch ring buffer within the appropriate table AM + callbacks, and managing positional state and TID recycling interlocking + (that is, determining when to unguard each batch, which will typically + release an index page buffer pin associated with the batch). Index access + methods should not access or manipulate these fields. + src/include/access/indexbatch.h provides the + tableam_util_* utility functions that table AMs use + to manage the ring buffer and its positional state. See the + src/backend/access/heap/heapam_indexscan.c + implementation for a reference example. + + + + The scanPos field within + batchringbuf tracks which batch and item within + that batch will be returned next to the executor. The table AM must advance + scanPos as tuples are returned by + table_index_getnext_slot (using + tableam_util_scanpos_advance, plus + tableam_util_scanpos_nextbatch when crossing batch + boundaries), and must also modify this field when restoring a saved mark. + + + + The prefetchPos field tracks the position used + for I/O prefetching. It is managed within a read stream callback (using + tableam_util_prefetchpos_catchup and + tableam_util_prefetchpos_advance), allowing + the table AM to prefetch table blocks pointed to by items that are well + ahead of the current scan position. Initially + prefetchPos starts at + scanPos, but as the read stream ramps up it can + get far ahead — spanning multiple index pages if necessary to + maintain an optimal I/O prefetch distance for table block reads. A major + goal of the amgetbatch interface is to allow the + table AM to prefetch without being limited to items from the current + scanPos batch's index leaf page. + + + + For details on the TID recycling interlock during batch scans, including + the batchImmediateUnguard policy and the + amunguardbatch callback, see + . + + + + @@ -1702,7 +1766,40 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype); immediately after scanning the corresponding index entry. This is expensive for a number of reasons. The amgetbatch interface, by contrast, was designed to - allow scans to be asynchronous. + allow scans to be asynchronous: by collecting batches of + TIDs from multiple index pages, the table AM can prefetch the corresponding + table blocks well ahead of the current scan position (using asynchronous + I/O when available), allowing a more efficient heap access pattern. Not + all scans end up being asynchronous in practice, but the interface is + designed to allow it. Per the above analysis, we must use the synchronous + approach for non-MVCC-compliant snapshots (even when using the + amgetbatch interface), but an asynchronous scan is + workable for plain index scans that use an MVCC snapshot. + + + + Because the table AM reads multiple index leaf pages ahead via + amgetbatch to facilitate this prefetching, a non-MVCC + scan would have to hold the TID recycling interlock across the entire + read-ahead window, since it has no heap-visibility backstop to fall back on. + That is impractical, so I/O prefetching with + amgetbatch is only possible when an MVCC-compliant + snapshot is in use. + + + + With an MVCC snapshot, a plain index scan drops each batch's interlock + immediately, since it always visits the heap page, where the snapshot + rejects any recycled TID's new occupant. An index-only scan may instead + skip the heap and consult the visibility map, so the table AM holds the + batch's interlock pin until it has copied that batch's visibility + information out of the visibility map, and then drops it. Either way, the + scan never holds more than one such interlock pin at a time, whether or not + prefetching is active — so in terms of pins held, an index-only scan + behaves much like a plain index scan. That single + extra pin is taken and released by the scan itself, outside the prefetching + read stream's own pin management; bounding it to one pin is what keeps it + from disturbing how the read stream budgets its buffer pins. diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml index 9ccf5b739..54b5ba2dc 100644 --- a/doc/src/sgml/tableam.sgml +++ b/doc/src/sgml/tableam.sgml @@ -129,6 +129,13 @@ my_tableam_handler(PG_FUNCTION_ARGS) optional), the block number needs to provide locality. + + Table access methods must support index scans that are driven by index + access methods implementing the amgetbatch interface. + See for details on consuming + amgetbatch batches and managing the scan's position. + + For crash safety, an AM can use postgres' WAL, or a custom implementation. diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out index 132b56a58..32bc3dd3e 100644 --- a/src/test/regress/expected/sysviews.out +++ b/src/test/regress/expected/sysviews.out @@ -166,6 +166,7 @@ select name, setting from pg_settings where name like 'enable%'; enable_incremental_sort | on enable_indexonlyscan | on enable_indexscan | on + enable_indexscan_prefetch | on enable_material | on enable_memoize | on enable_mergejoin | on @@ -180,7 +181,7 @@ select name, setting from pg_settings where name like 'enable%'; enable_seqscan | on enable_sort | on enable_tidscan | on -(25 rows) +(26 rows) -- There are always wait event descriptions for various types. InjectionPoint -- may be present or absent, depending on history since last postmaster start. diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index cbfcde303..191ce1d7c 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -266,6 +266,7 @@ BaseBackupTargetHandle BaseBackupTargetType BatchMVCCState BatchMatchingItem +BatchPosAdvanceResult BatchRingBuffer BatchRingItemPos BeginDirectModify_function -- 2.53.0