Quick Links

Re: Seq scans status update

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Patches <pgsql-patches(at)postgresql(dot)org>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Seq scans status update
Date:	2007-05-21 07:07:26
Message-ID:	4651452E.4060805@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-patches

I forgot to attach the program used to generate test data. Here it is.

Heikki Linnakangas wrote:
> Attached is a new version of Simon's "scan-resistant buffer manager"
> patch. It's not ready for committing yet because of a small issue I
> found this morning (* see bottom), but here's a status update.
>
> To recap, the basic idea is to use a small ring of buffers for large
> scans like VACUUM, COPY and seq-scans. Changes to the original patch:
>
> - a different sized ring is used for VACUUMs and seq-scans, and COPY.
> VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096
> buffers in default configuration. See README changes in the patch for
> rationale.
>
> - for queries with large seqscans, the buffer ring is only used for
> reads issued by the seq scan, not for any other reads in the query.
> Typical scenario where this matters is doing a large seq scan with a
> nested loop join to a smaller table. You don't want to use the buffer
> ring for index lookups inside the nested loop.
>
> - for seqscans, drop buffers from the ring that would need a WAL flush
> to reuse. That makes bulk updates to behave roughly like they do without
> the patch, instead of having to do a WAL flush every 32 pages.
>
> I've spent a lot of time thinking of solutions to the last point. The
> obvious solution would be to not use the buffer ring for updating scans.
> The difficulty with that is that we don't know if a scan is read-only in
> heapam.c, where the hint to use the buffer ring is set.
>
> I've completed a set of performance tests on a test server. The server
> has 4 GB of RAM, of which 1 GB is used for shared_buffers.
>
> Results for a 10 GB table:
>
> head-copy-bigtable | 00:10:09.07016
> head-copy-bigtable | 00:10:20.507357
> head-copy-bigtable | 00:10:21.857677
> head-copy_nowal-bigtable | 00:05:18.232956
> head-copy_nowal-bigtable | 00:03:24.109047
> head-copy_nowal-bigtable | 00:05:31.019643
> head-select-bigtable | 00:03:47.102731
> head-select-bigtable | 00:01:08.314719
> head-select-bigtable | 00:01:08.238509
> head-select-bigtable | 00:01:08.208563
> head-select-bigtable | 00:01:08.28347
> head-select-bigtable | 00:01:08.308671
> head-vacuum_clean-bigtable | 00:01:04.227832
> head-vacuum_clean-bigtable | 00:01:04.232258
> head-vacuum_clean-bigtable | 00:01:04.294621
> head-vacuum_clean-bigtable | 00:01:04.280677
> head-vacuum_hintbits-bigtable | 00:04:01.123924
> head-vacuum_hintbits-bigtable | 00:03:58.253175
> head-vacuum_hintbits-bigtable | 00:04:26.318159
> head-vacuum_hintbits-bigtable | 00:04:37.512965
> patched-copy-bigtable | 00:09:52.776754
> patched-copy-bigtable | 00:10:18.185826
> patched-copy-bigtable | 00:10:16.975482
> patched-copy_nowal-bigtable | 00:03:14.882366
> patched-copy_nowal-bigtable | 00:04:01.04648
> patched-copy_nowal-bigtable | 00:03:56.062272
> patched-select-bigtable | 00:03:47.704154
> patched-select-bigtable | 00:01:08.460326
> patched-select-bigtable | 00:01:10.441544
> patched-select-bigtable | 00:01:11.916221
> patched-select-bigtable | 00:01:13.848038
> patched-select-bigtable | 00:01:10.956133
> patched-vacuum_clean-bigtable | 00:01:10.315439
> patched-vacuum_clean-bigtable | 00:01:12.210537
> patched-vacuum_clean-bigtable | 00:01:15.202114
> patched-vacuum_clean-bigtable | 00:01:10.712235
> patched-vacuum_hintbits-bigtable | 00:03:42.279201
> patched-vacuum_hintbits-bigtable | 00:04:02.057778
> patched-vacuum_hintbits-bigtable | 00:04:26.805822
> patched-vacuum_hintbits-bigtable | 00:04:28.911184
>
> In other words, the patch has no significant effect, as expected. The
> select times did go up by a couple of seconds, which I didn't expect,
> though. One theory is that unused shared_buffers are swapped out during
> the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and
> try again at some point.
>
> Results for a 2 GB table:
>
> copy-medsize-unpatched | 00:02:18.23246
> copy-medsize-unpatched | 00:02:22.347194
> copy-medsize-unpatched | 00:02:23.875874
> copy_nowal-medsize-unpatched | 00:01:27.606334
> copy_nowal-medsize-unpatched | 00:01:17.491243
> copy_nowal-medsize-unpatched | 00:01:31.902719
> select-medsize-unpatched | 00:00:03.786031
> select-medsize-unpatched | 00:00:02.678069
> select-medsize-unpatched | 00:00:02.666103
> select-medsize-unpatched | 00:00:02.673494
> select-medsize-unpatched | 00:00:02.669645
> select-medsize-unpatched | 00:00:02.666278
> vacuum_clean-medsize-unpatched | 00:00:01.091356
> vacuum_clean-medsize-unpatched | 00:00:01.923138
> vacuum_clean-medsize-unpatched | 00:00:01.917213
> vacuum_clean-medsize-unpatched | 00:00:01.917333
> vacuum_hintbits-medsize-unpatched | 00:00:01.683718
> vacuum_hintbits-medsize-unpatched | 00:00:01.864003
> vacuum_hintbits-medsize-unpatched | 00:00:03.186596
> vacuum_hintbits-medsize-unpatched | 00:00:02.16494
> copy-medsize-patched | 00:02:35.113501
> copy-medsize-patched | 00:02:25.269866
> copy-medsize-patched | 00:02:31.881089
> copy_nowal-medsize-patched | 00:01:00.254633
> copy_nowal-medsize-patched | 00:01:04.630687
> copy_nowal-medsize-patched | 00:01:03.729128
> select-medsize-patched | 00:00:03.201837
> select-medsize-patched | 00:00:01.332975
> select-medsize-patched | 00:00:01.33014
> select-medsize-patched | 00:00:01.332392
> select-medsize-patched | 00:00:01.333498
> select-medsize-patched | 00:00:01.332692
> vacuum_clean-medsize-patched | 00:00:01.140189
> vacuum_clean-medsize-patched | 00:00:01.062762
> vacuum_clean-medsize-patched | 00:00:01.062402
> vacuum_clean-medsize-patched | 00:00:01.07113
> vacuum_hintbits-medsize-patched | 00:00:17.865446
> vacuum_hintbits-medsize-patched | 00:00:15.162064
> vacuum_hintbits-medsize-patched | 00:00:01.704651
> vacuum_hintbits-medsize-patched | 00:00:02.671651
>
> This looks good to me, except for some glitch at the last
> vacuum_hintbits tests. Selects and vacuums benefit significantly, as
> does non-WAL-logged copy.
>
> Not shown here, but I run tests earlier with vacuum on a table that
> actually had dead tuples to be removed on it. In that test the patched
> version really shined, reducing the runtime to ~ 1/6th. That was the
> original motivation of this patch: not having to do a WAL flush on every
> page in the 2nd phase of vacuum.
>
> Test script attached. To use it:
>
> 1. Edit testscript.sh. Change BIGTABLESIZE.
> 2. Start postmaster
> 3. Run script, giving test-label as argument. For example:
> "./testscript.sh bigtable-patched"
>
> Attached is also the patch I used for the tests.
>
> I would appreciate it if people would download the patch and the script
> and repeat the tests on different hardware. I'm particularly interested
> in testing on a box with good I/O hardware where selects on unpatched
> PostgreSQL are bottlenecked by CPU.
>
> Barring any surprises I'm going to fix the remaining issue and submit a
> final patch, probably in the weekend.
>
> (*) The issue with this patch is that if the buffer cache is completely
> filled with dirty buffers that need a WAL flush to evict, the buffer
> ring code will get into an infinite loop trying to find one that doesn't
> need a WAL flush. Should be simple to fix.
>
>
> ------------------------------------------------------------------------
>
> Index: src/backend/access/heap/heapam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
> retrieving revision 1.232
> diff -c -r1.232 heapam.c
> *** src/backend/access/heap/heapam.c 8 Apr 2007 01:26:27 -0000 1.232
> --- src/backend/access/heap/heapam.c 16 May 2007 11:35:14 -0000
> ***************
> *** 83,88 ****
> --- 83,96 ----
> */
> scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
>
> + /* A scan on a table smaller than shared_buffers is treated like random
> + * access, but bigger scans should use the bulk read replacement policy.
> + */
> + if (scan->rs_nblocks > NBuffers)
> + scan->rs_accesspattern = AP_BULKREAD;
> + else
> + scan->rs_accesspattern = AP_NORMAL;
> +
> scan->rs_inited = false;
> scan->rs_ctup.t_data = NULL;
> ItemPointerSetInvalid(&scan->rs_ctup.t_self);
> ***************
> *** 123,133 ****
> --- 131,146 ----
>
> Assert(page < scan->rs_nblocks);
>
> + /* Read the page with the right strategy */
> + SetAccessPattern(scan->rs_accesspattern);
> +
> scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf,
> scan->rs_rd,
> page);
> scan->rs_cblock = page;
>
> + SetAccessPattern(AP_NORMAL);
> +
> if (!scan->rs_pageatatime)
> return;
>
> Index: src/backend/access/transam/xlog.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.268
> diff -c -r1.268 xlog.c
> *** src/backend/access/transam/xlog.c 30 Apr 2007 21:01:52 -0000 1.268
> --- src/backend/access/transam/xlog.c 15 May 2007 16:23:30 -0000
> ***************
> *** 1668,1673 ****
> --- 1668,1700 ----
> }
>
> /*
> + * Returns true if 'record' hasn't been flushed to disk yet.
> + */
> + bool
> + XLogNeedsFlush(XLogRecPtr record)
> + {
> + /* Quick exit if already known flushed */
> + if (XLByteLE(record, LogwrtResult.Flush))
> + return false;
> +
> + /* read LogwrtResult and update local state */
> + {
> + /* use volatile pointer to prevent code rearrangement */
> + volatile XLogCtlData *xlogctl = XLogCtl;
> +
> + SpinLockAcquire(&xlogctl->info_lck);
> + LogwrtResult = xlogctl->LogwrtResult;
> + SpinLockRelease(&xlogctl->info_lck);
> + }
> +
> + /* check again */
> + if (XLByteLE(record, LogwrtResult.Flush))
> + return false;
> +
> + return true;
> + }
> +
> + /*
> * Ensure that all XLOG data through the given position is flushed to disk.
> *
> * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
> Index: src/backend/commands/copy.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
> retrieving revision 1.283
> diff -c -r1.283 copy.c
> *** src/backend/commands/copy.c 27 Apr 2007 22:05:46 -0000 1.283
> --- src/backend/commands/copy.c 15 May 2007 17:05:29 -0000
> ***************
> *** 1876,1881 ****
> --- 1876,1888 ----
> nfields = file_has_oids ? (attr_count + 1) : attr_count;
> field_strings = (char **) palloc(nfields * sizeof(char *));
>
> + /* Use the special COPY buffer replacement strategy if WAL-logging
> + * is enabled. If it's not, the pages we're writing are dirty but
> + * don't need a WAL flush to write out, so the BULKREAD strategy
> + * is more suitable.
> + */
> + SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD);
> +
> /* Initialize state variables */
> cstate->fe_eof = false;
> cstate->eol_type = EOL_UNKNOWN;
> ***************
> *** 2161,2166 ****
> --- 2168,2176 ----
> cstate->filename)));
> }
>
> + /* Reset buffer replacement strategy */
> + SetAccessPattern(AP_NORMAL);
> +
> /*
> * If we skipped writing WAL, then we need to sync the heap (but not
> * indexes since those use WAL anyway)
> Index: src/backend/commands/vacuum.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v
> retrieving revision 1.350
> diff -c -r1.350 vacuum.c
> *** src/backend/commands/vacuum.c 16 Apr 2007 18:29:50 -0000 1.350
> --- src/backend/commands/vacuum.c 15 May 2007 17:06:18 -0000
> ***************
> *** 421,431 ****
> * Tell the buffer replacement strategy that vacuum is causing
> * the IO
> */
> ! StrategyHintVacuum(true);
>
> analyze_rel(relid, vacstmt);
>
> ! StrategyHintVacuum(false);
>
> if (use_own_xacts)
> CommitTransactionCommand();
> --- 421,431 ----
> * Tell the buffer replacement strategy that vacuum is causing
> * the IO
> */
> ! SetAccessPattern(AP_VACUUM);
>
> analyze_rel(relid, vacstmt);
>
> ! SetAccessPattern(AP_NORMAL);
>
> if (use_own_xacts)
> CommitTransactionCommand();
> ***************
> *** 442,448 ****
> /* Make sure cost accounting is turned off after error */
> VacuumCostActive = false;
> /* And reset buffer replacement strategy, too */
> ! StrategyHintVacuum(false);
> PG_RE_THROW();
> }
> PG_END_TRY();
> --- 442,448 ----
> /* Make sure cost accounting is turned off after error */
> VacuumCostActive = false;
> /* And reset buffer replacement strategy, too */
> ! SetAccessPattern(AP_NORMAL);
> PG_RE_THROW();
> }
> PG_END_TRY();
> ***************
> *** 1088,1094 ****
> * Tell the cache replacement strategy that vacuum is causing all
> * following IO
> */
> ! StrategyHintVacuum(true);
>
> /*
> * Do the actual work --- either FULL or "lazy" vacuum
> --- 1088,1094 ----
> * Tell the cache replacement strategy that vacuum is causing all
> * following IO
> */
> ! SetAccessPattern(AP_VACUUM);
>
> /*
> * Do the actual work --- either FULL or "lazy" vacuum
> ***************
> *** 1098,1104 ****
> else
> lazy_vacuum_rel(onerel, vacstmt);
>
> ! StrategyHintVacuum(false);
>
> /* all done with this class, but hold lock until commit */
> relation_close(onerel, NoLock);
> --- 1098,1104 ----
> else
> lazy_vacuum_rel(onerel, vacstmt);
>
> ! SetAccessPattern(AP_NORMAL);
>
> /* all done with this class, but hold lock until commit */
> relation_close(onerel, NoLock);
> Index: src/backend/storage/buffer/README
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v
> retrieving revision 1.11
> diff -c -r1.11 README
> *** src/backend/storage/buffer/README 23 Jul 2006 03:07:58 -0000 1.11
> --- src/backend/storage/buffer/README 16 May 2007 11:43:11 -0000
> ***************
> *** 152,159 ****
> a field to show which backend is doing its I/O).
>
>
> ! Buffer replacement strategy
> ! ---------------------------
>
> There is a "free list" of buffers that are prime candidates for replacement.
> In particular, buffers that are completely free (contain no valid page) are
> --- 152,159 ----
> a field to show which backend is doing its I/O).
>
>
> ! Normal buffer replacement strategy
> ! ----------------------------------
>
> There is a "free list" of buffers that are prime candidates for replacement.
> In particular, buffers that are completely free (contain no valid page) are
> ***************
> *** 199,221 ****
> have to give up and try another buffer. This however is not a concern
> of the basic select-a-victim-buffer algorithm.)
>
> - A special provision is that while running VACUUM, a backend does not
> - increment the usage count on buffers it accesses. In fact, if ReleaseBuffer
> - sees that it is dropping the pin count to zero and the usage count is zero,
> - then it appends the buffer to the tail of the free list. (This implies that
> - VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
> - this shouldn't create much of a contention problem.) This provision
> - encourages VACUUM to work in a relatively small number of buffers rather
> - than blowing out the entire buffer cache. It is reasonable since a page
> - that has been touched only by VACUUM is unlikely to be needed again soon.
> -
> - Since VACUUM usually requests many pages very fast, the effect of this is that
> - it will get back the very buffers it filled and possibly modified on the next
> - call and will therefore do its work in a few shared memory buffers, while
> - being able to use whatever it finds in the cache already. This also implies
> - that most of the write traffic caused by a VACUUM will be done by the VACUUM
> - itself and not pushed off onto other processes.
>
>
> Background writer's processing
> ------------------------------
> --- 199,243 ----
> have to give up and try another buffer. This however is not a concern
> of the basic select-a-victim-buffer algorithm.)
>
>
> + Buffer ring replacement strategy
> + ---------------------------------
> +
> + When running a query that needs to access a large number of pages, like VACUUM,
> + COPY, or a large sequential scan, a different strategy is used. A page that
> + has been touched only by such a scan is unlikely to be needed again soon, so
> + instead of running the normal clock sweep algorithm and blowing out the entire
> + buffer cache, a small ring of buffers is allocated using the normal clock sweep
> + algorithm and those buffers are reused for the whole scan. This also implies
> + that most of the write traffic caused by such a statement will be done by the
> + backend itself and not pushed off onto other processes.
> +
> + The size of the ring used depends on the kind of scan:
> +
> + For sequential scans, a small 256 KB ring is used. That's small enough to fit
> + in L2 cache, which makes transferring pages from OS cache to shared buffer
> + cache efficient. Even less would often be enough, but the ring must be big
> + enough to accommodate all pages in the scan that are pinned concurrently.
> + 256 KB should also be enough to leave a small cache trail for other backends to
> + join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer
> + is removed from the ring and a replacement buffer is chosen using the normal
> + replacement strategy. In a scan that modifies every page in the scan, like a
> + bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the
> + ring strategy effectively degrades to the normal strategy.
> +
> + VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not
> + removed from the ring. WAL is flushed instead to allow reuse of the buffers.
> + Before introducing the buffer ring strategy in 8.3, buffers were put to the
> + freelist, which was effectively a buffer ring of 1 buffer.
> +
> + COPY behaves like VACUUM, but a much larger ring is used. The ring size is
> + chosen to be twice the WAL segment size. This avoids polluting the buffer cache
> + like the clock sweep would do, and using a ring larger than WAL segment size
> + avoids having to do any extra WAL flushes, since a WAL segment will always be
> + filled, forcing a WAL flush, before looping through the buffer ring and bumping
> + into a buffer that would force a WAL flush. However, for non-WAL-logged COPY
> + operations the smaller 256 KB ring is used because WAL flushes are not needed
> + to write the buffers.
>
> Background writer's processing
> ------------------------------
> Index: src/backend/storage/buffer/bufmgr.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
> retrieving revision 1.218
> diff -c -r1.218 bufmgr.c
> *** src/backend/storage/buffer/bufmgr.c 2 May 2007 23:34:48 -0000 1.218
> --- src/backend/storage/buffer/bufmgr.c 16 May 2007 12:34:10 -0000
> ***************
> *** 419,431 ****
> /* Loop here in case we have to try another victim buffer */
> for (;;)
> {
> /*
> * Select a victim buffer. The buffer is returned with its header
> * spinlock still held! Also the BufFreelistLock is still held, since
> * it would be bad to hold the spinlock while possibly waking up other
> * processes.
> */
> ! buf = StrategyGetBuffer();
>
> Assert(buf->refcount == 0);
>
> --- 419,433 ----
> /* Loop here in case we have to try another victim buffer */
> for (;;)
> {
> + bool lock_held;
> +
> /*
> * Select a victim buffer. The buffer is returned with its header
> * spinlock still held! Also the BufFreelistLock is still held, since
> * it would be bad to hold the spinlock while possibly waking up other
> * processes.
> */
> ! buf = StrategyGetBuffer(&lock_held);
>
> Assert(buf->refcount == 0);
>
> ***************
> *** 436,442 ****
> PinBuffer_Locked(buf);
>
> /* Now it's safe to release the freelist lock */
> ! LWLockRelease(BufFreelistLock);
>
> /*
> * If the buffer was dirty, try to write it out. There is a race
> --- 438,445 ----
> PinBuffer_Locked(buf);
>
> /* Now it's safe to release the freelist lock */
> ! if (lock_held)
> ! LWLockRelease(BufFreelistLock);
>
> /*
> * If the buffer was dirty, try to write it out. There is a race
> ***************
> *** 464,469 ****
> --- 467,489 ----
> */
> if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
> {
> + /* In BULKREAD-mode, check if a WAL flush would be needed to
> + * evict this buffer. If so, ask the replacement strategy if
> + * we should go ahead and do it or choose another victim.
> + */
> + if (active_access_pattern == AP_BULKREAD)
> + {
> + if (XLogNeedsFlush(BufferGetLSN(buf)))
> + {
> + if (StrategyRejectBuffer(buf))
> + {
> + LWLockRelease(buf->content_lock);
> + UnpinBuffer(buf, true, false);
> + continue;
> + }
> + }
> + }
> +
> FlushBuffer(buf, NULL);
> LWLockRelease(buf->content_lock);
> }
> ***************
> *** 925,932 ****
> PrivateRefCount[b]--;
> if (PrivateRefCount[b] == 0)
> {
> - bool immed_free_buffer = false;
> -
> /* I'd better not still hold any locks on the buffer */
> Assert(!LWLockHeldByMe(buf->content_lock));
> Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
> --- 945,950 ----
> ***************
> *** 940,956 ****
> /* Update buffer usage info, unless this is an internal access */
> if (normalAccess)
> {
> ! if (!strategy_hint_vacuum)
> {
> ! if (buf->usage_count < BM_MAX_USAGE_COUNT)
> ! buf->usage_count++;
> }
> else
> ! {
> ! /* VACUUM accesses don't bump usage count, instead... */
> ! if (buf->refcount == 0 && buf->usage_count == 0)
> ! immed_free_buffer = true;
> ! }
> }
>
> if ((buf->flags & BM_PIN_COUNT_WAITER) &&
> --- 958,975 ----
> /* Update buffer usage info, unless this is an internal access */
> if (normalAccess)
> {
> ! if (active_access_pattern != AP_NORMAL)
> {
> ! /* We don't want large one-off scans like vacuum to inflate
> ! * the usage_count. We do want to set it to 1, though, to keep
> ! * other backends from hijacking it from the buffer ring.
> ! */
> ! if (buf->usage_count == 0)
> ! buf->usage_count = 1;
> }
> else
> ! if (buf->usage_count < BM_MAX_USAGE_COUNT)
> ! buf->usage_count++;
> }
>
> if ((buf->flags & BM_PIN_COUNT_WAITER) &&
> ***************
> *** 965,978 ****
> }
> else
> UnlockBufHdr(buf);
> -
> - /*
> - * If VACUUM is releasing an otherwise-unused buffer, send it to the
> - * freelist for near-term reuse. We put it at the tail so that it
> - * won't be used before any invalid buffers that may exist.
> - */
> - if (immed_free_buffer)
> - StrategyFreeBuffer(buf, false);
> }
> }
>
> --- 984,989 ----
> Index: src/backend/storage/buffer/freelist.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v
> retrieving revision 1.58
> diff -c -r1.58 freelist.c
> *** src/backend/storage/buffer/freelist.c 5 Jan 2007 22:19:37 -0000 1.58
> --- src/backend/storage/buffer/freelist.c 17 May 2007 16:12:56 -0000
> ***************
> *** 18,23 ****
> --- 18,25 ----
> #include "storage/buf_internals.h"
> #include "storage/bufmgr.h"
>
> + #include "utils/memutils.h"
> +
>
> /*
> * The shared freelist control information.
> ***************
> *** 39,47 ****
> /* Pointers to shared state */
> static BufferStrategyControl *StrategyControl = NULL;
>
> ! /* Backend-local state about whether currently vacuuming */
> ! bool strategy_hint_vacuum = false;
>
>
> /*
> * StrategyGetBuffer
> --- 41,53 ----
> /* Pointers to shared state */
> static BufferStrategyControl *StrategyControl = NULL;
>
> ! /* Currently active access pattern hint. */
> ! AccessPattern active_access_pattern = AP_NORMAL;
>
> + /* prototypes for internal functions */
> + static volatile BufferDesc *GetBufferFromRing(void);
> + static void PutBufferToRing(volatile BufferDesc *buf);
> + static void InitRing(void);
>
> /*
> * StrategyGetBuffer
> ***************
> *** 51,67 ****
> * the selected buffer must not currently be pinned by anyone.
> *
> * To ensure that no one else can pin the buffer before we do, we must
> ! * return the buffer with the buffer header spinlock still held. That
> ! * means that we return with the BufFreelistLock still held, as well;
> ! * the caller must release that lock once the spinlock is dropped.
> */
> volatile BufferDesc *
> ! StrategyGetBuffer(void)
> {
> volatile BufferDesc *buf;
> int trycounter;
>
> LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
>
> /*
> * Try to get a buffer from the freelist. Note that the freeNext fields
> --- 57,89 ----
> * the selected buffer must not currently be pinned by anyone.
> *
> * To ensure that no one else can pin the buffer before we do, we must
> ! * return the buffer with the buffer header spinlock still held. If
> ! * *lock_held is set at return, we return with the BufFreelistLock still
> ! * held, as well; the caller must release that lock once the spinlock is
> ! * dropped.
> */
> volatile BufferDesc *
> ! StrategyGetBuffer(bool *lock_held)
> {
> volatile BufferDesc *buf;
> int trycounter;
>
> + /* Get a buffer from the ring if we're doing a bulk scan */
> + if (active_access_pattern != AP_NORMAL)
> + {
> + buf = GetBufferFromRing();
> + if (buf != NULL)
> + {
> + *lock_held = false;
> + return buf;
> + }
> + }
> +
> + /*
> + * If our selected buffer wasn't available, pick another...
> + */
> LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> + *lock_held = true;
>
> /*
> * Try to get a buffer from the freelist. Note that the freeNext fields
> ***************
> *** 86,96 ****
> */
> LockBufHdr(buf);
> if (buf->refcount == 0 && buf->usage_count == 0)
> return buf;
> UnlockBufHdr(buf);
> }
>
> ! /* Nothing on the freelist, so run the "clock sweep" algorithm */
> trycounter = NBuffers;
> for (;;)
> {
> --- 108,122 ----
> */
> LockBufHdr(buf);
> if (buf->refcount == 0 && buf->usage_count == 0)
> + {
> + if (active_access_pattern != AP_NORMAL)
> + PutBufferToRing(buf);
> return buf;
> + }
> UnlockBufHdr(buf);
> }
>
> ! /* Nothing on the freelist, so run the shared "clock sweep" algorithm */
> trycounter = NBuffers;
> for (;;)
> {
> ***************
> *** 105,111 ****
> --- 131,141 ----
> */
> LockBufHdr(buf);
> if (buf->refcount == 0 && buf->usage_count == 0)
> + {
> + if (active_access_pattern != AP_NORMAL)
> + PutBufferToRing(buf);
> return buf;
> + }
> if (buf->usage_count > 0)
> {
> buf->usage_count--;
> ***************
> *** 191,204 ****
> }
>
> /*
> ! * StrategyHintVacuum -- tell us whether VACUUM is active
> */
> void
> ! StrategyHintVacuum(bool vacuum_active)
> {
> ! strategy_hint_vacuum = vacuum_active;
> ! }
>
>
> /*
> * StrategyShmemSize
> --- 221,245 ----
> }
>
> /*
> ! * SetAccessPattern -- Sets the active access pattern hint
> ! *
> ! * Caller is responsible for resetting the hint to AP_NORMAL after the bulk
> ! * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of
> ! * the other strategies, for example in a query with one large sequential scan
> ! * nested loop joined to an index scan. Index tuples should be fetched with the
> ! * normal strategy and the pages from the seq scan should be read in with the
> ! * AP_BULKREAD strategy. The ring won't be affected by such switching, however
> ! * switching to an access pattern with different ring size will invalidate the
> ! * old ring.
> */
> void
> ! SetAccessPattern(AccessPattern new_pattern)
> {
> ! active_access_pattern = new_pattern;
>
> + if (active_access_pattern != AP_NORMAL)
> + InitRing();
> + }
>
> /*
> * StrategyShmemSize
> ***************
> *** 274,276 ****
> --- 315,498 ----
> else
> Assert(!init);
> }
> +
> + /* ----------------------------------------------------------------
> + * Backend-private buffer ring management
> + * ----------------------------------------------------------------
> + */
> +
> + /*
> + * Ring sizes for different access patterns. See README for the rationale
> + * of these.
> + */
> + #define BULKREAD_RING_SIZE 256 * 1024 / BLCKSZ
> + #define VACUUM_RING_SIZE 256 * 1024 / BLCKSZ
> + #define COPY_RING_SIZE Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2)
> +
> + /*
> + * BufferRing is an array of buffer ids, and RingSize it's size in number of
> + * elements. It's allocated in TopMemoryContext the first time it's needed.
> + */
> + static int *BufferRing = NULL;
> + static int RingSize = 0;
> +
> + /* Index of the "current" slot in the ring. It's advanced every time a buffer
> + * is handed out from the ring with GetBufferFromRing and it points to the
> + * last buffer returned from the ring. RingCurSlot + 1 is the next victim
> + * GetBufferRing will hand out.
> + */
> + static int RingCurSlot = 0;
> +
> + /* magic value to mark empty slots in the ring */
> + #define BUF_ID_NOT_SET -1
> +
> +
> + /*
> + * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
> + * ring is empty.
> + *
> + * The bufhdr spin lock is held on the returned buffer.
> + */
> + static volatile BufferDesc *
> + GetBufferFromRing(void)
> + {
> + volatile BufferDesc *buf;
> +
> + /* ring should be initialized by now */
> + Assert(RingSize > 0 && BufferRing != NULL);
> +
> + /* Run private "clock cycle" */
> + if (++RingCurSlot >= RingSize)
> + RingCurSlot = 0;
> +
> + /*
> + * If that slot hasn't been filled yet, tell the caller to allocate
> + * a new buffer with the normal allocation strategy. He will then
> + * fill this slot by calling PutBufferToRing with the new buffer.
> + */
> + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
> + return NULL;
> +
> + buf = &BufferDescriptors[BufferRing[RingCurSlot]];
> +
> + /*
> + * If the buffer is pinned we cannot use it under any circumstances.
> + * If usage_count == 0 then the buffer is fair game.
> + *
> + * We also choose this buffer if usage_count == 1. Strictly, this
> + * might sometimes be the wrong thing to do, but we rely on the high
> + * probability that it was this process that last touched the buffer.
> + * If it wasn't, we'll choose a suboptimal victim, but it shouldn't
> + * make any difference in the big scheme of things.
> + *
> + */
> + LockBufHdr(buf);
> + if (buf->refcount == 0 && buf->usage_count <= 1)
> + return buf;
> + UnlockBufHdr(buf);
> +
> + return NULL;
> + }
> +
> + /*
> + * PutBufferToRing -- adds a buffer to the buffer ring
> + *
> + * Caller must hold the buffer header spinlock on the buffer.
> + */
> + static void
> + PutBufferToRing(volatile BufferDesc *buf)
> + {
> + /* ring should be initialized by now */
> + Assert(RingSize > 0 && BufferRing != NULL);
> +
> + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
> + BufferRing[RingCurSlot] = buf->buf_id;
> + }
> +
> + /*
> + * Initializes a ring buffer with correct size for the currently
> + * active strategy. Does nothing if the ring already has the right size.
> + */
> + static void
> + InitRing(void)
> + {
> + int new_size;
> + int old_size = RingSize;
> + int i;
> + MemoryContext oldcxt;
> +
> + /* Determine new size */
> +
> + switch(active_access_pattern)
> + {
> + case AP_BULKREAD:
> + new_size = BULKREAD_RING_SIZE;
> + break;
> + case AP_COPY:
> + new_size = COPY_RING_SIZE;
> + break;
> + case AP_VACUUM:
> + new_size = VACUUM_RING_SIZE;
> + break;
> + default:
> + elog(ERROR, "unexpected buffer cache strategy %d",
> + active_access_pattern);
> + return; /* keep compile happy */
> + }
> +
> + /*
> + * Seq scans set and reset the strategy on every page, so we better exit
> + * quickly if no change in size is needed.
> + */
> + if (new_size == old_size)
> + return;
> +
> + /* Allocate array */
> +
> + oldcxt = MemoryContextSwitchTo(TopMemoryContext);
> +
> + if (old_size == 0)
> + {
> + Assert(BufferRing == NULL);
> + BufferRing = palloc(new_size * sizeof(int));
> + }
> + else
> + BufferRing = repalloc(BufferRing, new_size * sizeof(int));
> +
> + MemoryContextSwitchTo(oldcxt);
> +
> + for(i = 0; i < new_size; i++)
> + BufferRing[i] = BUF_ID_NOT_SET;
> +
> + RingCurSlot = 0;
> + RingSize = new_size;
> + }
> +
> + /*
> + * Buffer manager calls this function in AP_BULKREAD mode when the
> + * buffer handed to it turns out to need a WAL flush to write out. This
> + * gives the strategy a second chance to choose another victim.
> + *
> + * Returns true if buffer manager should ask for a new victim, and false
> + * if WAL should be flushed and this buffer used.
> + */
> + bool
> + StrategyRejectBuffer(volatile BufferDesc *buf)
> + {
> + Assert(RingSize > 0);
> +
> + if (BufferRing[RingCurSlot] == buf->buf_id)
> + {
> + BufferRing[RingCurSlot] = BUF_ID_NOT_SET;
> + return true;
> + }
> + else
> + {
> + /* Apparently the buffer didn't come from the ring. We don't want to
> + * mess with how the clock sweep works; in worst case there's no
> + * buffers in the buffer cache that can be reused without a WAL flush,
> + * and we'd get into an endless loop trying.
> + */
> + return false;
> + }
> + }
> Index: src/include/access/relscan.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v
> retrieving revision 1.52
> diff -c -r1.52 relscan.h
> *** src/include/access/relscan.h 20 Jan 2007 18:43:35 -0000 1.52
> --- src/include/access/relscan.h 15 May 2007 17:01:31 -0000
> ***************
> *** 28,33 ****
> --- 28,34 ----
> ScanKey rs_key; /* array of scan key descriptors */
> BlockNumber rs_nblocks; /* number of blocks to scan */
> bool rs_pageatatime; /* verify visibility page-at-a-time? */
> + AccessPattern rs_accesspattern; /* access pattern to use for reads */
>
> /* scan current state */
> bool rs_inited; /* false = scan not init'd yet */
> Index: src/include/access/xlog.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
> retrieving revision 1.76
> diff -c -r1.76 xlog.h
> *** src/include/access/xlog.h 5 Jan 2007 22:19:51 -0000 1.76
> --- src/include/access/xlog.h 14 May 2007 21:22:40 -0000
> ***************
> *** 151,156 ****
> --- 151,157 ----
>
> extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
> extern void XLogFlush(XLogRecPtr RecPtr);
> + extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
>
> extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
> extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
> Index: src/include/storage/buf_internals.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
> retrieving revision 1.89
> diff -c -r1.89 buf_internals.h
> *** src/include/storage/buf_internals.h 5 Jan 2007 22:19:57 -0000 1.89
> --- src/include/storage/buf_internals.h 15 May 2007 17:07:59 -0000
> ***************
> *** 16,21 ****
> --- 16,22 ----
> #define BUFMGR_INTERNALS_H
>
> #include "storage/buf.h"
> + #include "storage/bufmgr.h"
> #include "storage/lwlock.h"
> #include "storage/shmem.h"
> #include "storage/spin.h"
> ***************
> *** 168,174 ****
> extern BufferDesc *LocalBufferDescriptors;
>
> /* in freelist.c */
> ! extern bool strategy_hint_vacuum;
>
> /* event counters in buf_init.c */
> extern long int ReadBufferCount;
> --- 169,175 ----
> extern BufferDesc *LocalBufferDescriptors;
>
> /* in freelist.c */
> ! extern AccessPattern active_access_pattern;
>
> /* event counters in buf_init.c */
> extern long int ReadBufferCount;
> ***************
> *** 184,195 ****
> */
>
> /* freelist.c */
> ! extern volatile BufferDesc *StrategyGetBuffer(void);
> extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
> extern int StrategySyncStart(void);
> extern Size StrategyShmemSize(void);
> extern void StrategyInitialize(bool init);
>
> /* buf_table.c */
> extern Size BufTableShmemSize(int size);
> extern void InitBufTable(int size);
> --- 185,198 ----
> */
>
> /* freelist.c */
> ! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held);
> extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
> extern int StrategySyncStart(void);
> extern Size StrategyShmemSize(void);
> extern void StrategyInitialize(bool init);
>
> + extern bool StrategyRejectBuffer(volatile BufferDesc *buf);
> +
> /* buf_table.c */
> extern Size BufTableShmemSize(int size);
> extern void InitBufTable(int size);
> Index: src/include/storage/bufmgr.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
> retrieving revision 1.103
> diff -c -r1.103 bufmgr.h
> *** src/include/storage/bufmgr.h 2 May 2007 23:18:03 -0000 1.103
> --- src/include/storage/bufmgr.h 15 May 2007 17:07:02 -0000
> ***************
> *** 48,53 ****
> --- 48,61 ----
> #define BUFFER_LOCK_SHARE 1
> #define BUFFER_LOCK_EXCLUSIVE 2
>
> + typedef enum AccessPattern
> + {
> + AP_NORMAL, /* Normal random access */
> + AP_BULKREAD, /* Large read-only scan (hint bit updates are ok) */
> + AP_COPY, /* Large updating scan, like COPY with WAL enabled */
> + AP_VACUUM, /* VACUUM */
> + } AccessPattern;
> +
> /*
> * These routines are beaten on quite heavily, hence the macroization.
> */
> ***************
> *** 157,162 ****
> extern void AtProcExit_LocalBuffers(void);
>
> /* in freelist.c */
> ! extern void StrategyHintVacuum(bool vacuum_active);
>
> #endif
> --- 165,170 ----
> extern void AtProcExit_LocalBuffers(void);
>
> /* in freelist.c */
> ! extern void SetAccessPattern(AccessPattern new_pattern);
>
> #endif

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
gendata.c	text/x-csrc	406 bytes

In response to

Seq scans status update at 2007-05-17 16:32:05 from Heikki Linnakangas

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Guillaume Lelarge	2007-05-21 07:43:56	Patch needed fot dt.h
Previous Message	NikhilS	2007-05-21 06:53:42	Re: CREATE TABLE LIKE INCLUDING INDEXES support