Skip site navigation (1) Skip section navigation (2)

Re: Seq scans status update

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Patches <pgsql-patches(at)postgresql(dot)org>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Seq scans status update
Date: 2007-05-21 07:07:26
Message-ID: 4651452E.4060805@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-patches
I forgot to attach the program used to generate test data. Here it is.

Heikki Linnakangas wrote:
> Attached is a new version of Simon's "scan-resistant buffer manager" 
> patch. It's not ready for committing yet because of a small issue I 
> found this morning (* see bottom), but here's a status update.
> 
> To recap, the basic idea is to use a small ring of buffers for large 
> scans like VACUUM, COPY and seq-scans. Changes to the original patch:
> 
> - a different sized ring is used for VACUUMs and seq-scans, and COPY. 
> VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096 
> buffers in default configuration. See README changes in the patch for 
> rationale.
> 
> - for queries with large seqscans, the buffer ring is only used for 
> reads issued by the seq scan, not for any other reads in the query. 
> Typical scenario where this matters is doing a large seq scan with a 
> nested loop join to a smaller table. You don't want to use the buffer 
> ring for index lookups inside the nested loop.
> 
> - for seqscans, drop buffers from the ring that would need a WAL flush 
> to reuse. That makes bulk updates to behave roughly like they do without 
> the patch, instead of having to do a WAL flush every 32 pages.
> 
> I've spent a lot of time thinking of solutions to the last point. The 
> obvious solution would be to not use the buffer ring for updating scans. 
> The difficulty with that is that we don't know if a scan is read-only in 
> heapam.c, where the hint to use the buffer ring is set.
> 
> I've completed a set of performance tests on a test server. The server 
> has 4 GB of RAM, of which 1 GB is used for shared_buffers.
> 
> Results for a 10 GB table:
> 
>  head-copy-bigtable               | 00:10:09.07016
>  head-copy-bigtable               | 00:10:20.507357
>  head-copy-bigtable               | 00:10:21.857677
>  head-copy_nowal-bigtable         | 00:05:18.232956
>  head-copy_nowal-bigtable         | 00:03:24.109047
>  head-copy_nowal-bigtable         | 00:05:31.019643
>  head-select-bigtable             | 00:03:47.102731
>  head-select-bigtable             | 00:01:08.314719
>  head-select-bigtable             | 00:01:08.238509
>  head-select-bigtable             | 00:01:08.208563
>  head-select-bigtable             | 00:01:08.28347
>  head-select-bigtable             | 00:01:08.308671
>  head-vacuum_clean-bigtable       | 00:01:04.227832
>  head-vacuum_clean-bigtable       | 00:01:04.232258
>  head-vacuum_clean-bigtable       | 00:01:04.294621
>  head-vacuum_clean-bigtable       | 00:01:04.280677
>  head-vacuum_hintbits-bigtable    | 00:04:01.123924
>  head-vacuum_hintbits-bigtable    | 00:03:58.253175
>  head-vacuum_hintbits-bigtable    | 00:04:26.318159
>  head-vacuum_hintbits-bigtable    | 00:04:37.512965
>  patched-copy-bigtable            | 00:09:52.776754
>  patched-copy-bigtable            | 00:10:18.185826
>  patched-copy-bigtable            | 00:10:16.975482
>  patched-copy_nowal-bigtable      | 00:03:14.882366
>  patched-copy_nowal-bigtable      | 00:04:01.04648
>  patched-copy_nowal-bigtable      | 00:03:56.062272
>  patched-select-bigtable          | 00:03:47.704154
>  patched-select-bigtable          | 00:01:08.460326
>  patched-select-bigtable          | 00:01:10.441544
>  patched-select-bigtable          | 00:01:11.916221
>  patched-select-bigtable          | 00:01:13.848038
>  patched-select-bigtable          | 00:01:10.956133
>  patched-vacuum_clean-bigtable    | 00:01:10.315439
>  patched-vacuum_clean-bigtable    | 00:01:12.210537
>  patched-vacuum_clean-bigtable    | 00:01:15.202114
>  patched-vacuum_clean-bigtable    | 00:01:10.712235
>  patched-vacuum_hintbits-bigtable | 00:03:42.279201
>  patched-vacuum_hintbits-bigtable | 00:04:02.057778
>  patched-vacuum_hintbits-bigtable | 00:04:26.805822
>  patched-vacuum_hintbits-bigtable | 00:04:28.911184
> 
> In other words, the patch has no significant effect, as expected. The 
> select times did go up by a couple of seconds, which I didn't expect, 
> though. One theory is that unused shared_buffers are swapped out during 
> the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and 
> try again at some point.
> 
> Results for a 2 GB table:
> 
>  copy-medsize-unpatched            | 00:02:18.23246
>  copy-medsize-unpatched            | 00:02:22.347194
>  copy-medsize-unpatched            | 00:02:23.875874
>  copy_nowal-medsize-unpatched      | 00:01:27.606334
>  copy_nowal-medsize-unpatched      | 00:01:17.491243
>  copy_nowal-medsize-unpatched      | 00:01:31.902719
>  select-medsize-unpatched          | 00:00:03.786031
>  select-medsize-unpatched          | 00:00:02.678069
>  select-medsize-unpatched          | 00:00:02.666103
>  select-medsize-unpatched          | 00:00:02.673494
>  select-medsize-unpatched          | 00:00:02.669645
>  select-medsize-unpatched          | 00:00:02.666278
>  vacuum_clean-medsize-unpatched    | 00:00:01.091356
>  vacuum_clean-medsize-unpatched    | 00:00:01.923138
>  vacuum_clean-medsize-unpatched    | 00:00:01.917213
>  vacuum_clean-medsize-unpatched    | 00:00:01.917333
>  vacuum_hintbits-medsize-unpatched | 00:00:01.683718
>  vacuum_hintbits-medsize-unpatched | 00:00:01.864003
>  vacuum_hintbits-medsize-unpatched | 00:00:03.186596
>  vacuum_hintbits-medsize-unpatched | 00:00:02.16494
>  copy-medsize-patched              | 00:02:35.113501
>  copy-medsize-patched              | 00:02:25.269866
>  copy-medsize-patched              | 00:02:31.881089
>  copy_nowal-medsize-patched        | 00:01:00.254633
>  copy_nowal-medsize-patched        | 00:01:04.630687
>  copy_nowal-medsize-patched        | 00:01:03.729128
>  select-medsize-patched            | 00:00:03.201837
>  select-medsize-patched            | 00:00:01.332975
>  select-medsize-patched            | 00:00:01.33014
>  select-medsize-patched            | 00:00:01.332392
>  select-medsize-patched            | 00:00:01.333498
>  select-medsize-patched            | 00:00:01.332692
>  vacuum_clean-medsize-patched      | 00:00:01.140189
>  vacuum_clean-medsize-patched      | 00:00:01.062762
>  vacuum_clean-medsize-patched      | 00:00:01.062402
>  vacuum_clean-medsize-patched      | 00:00:01.07113
>  vacuum_hintbits-medsize-patched   | 00:00:17.865446
>  vacuum_hintbits-medsize-patched   | 00:00:15.162064
>  vacuum_hintbits-medsize-patched   | 00:00:01.704651
>  vacuum_hintbits-medsize-patched   | 00:00:02.671651
> 
> This looks good to me, except for some glitch at the last 
> vacuum_hintbits tests. Selects and vacuums benefit significantly, as 
> does non-WAL-logged copy.
> 
> Not shown here, but I run tests earlier with vacuum on a table that 
> actually had dead tuples to be removed on it. In that test the patched 
> version really shined, reducing the runtime to ~ 1/6th. That was the 
> original motivation of this patch: not having to do a WAL flush on every 
> page in the 2nd phase of vacuum.
> 
> Test script attached. To use it:
> 
> 1. Edit testscript.sh. Change BIGTABLESIZE.
> 2. Start postmaster
> 3. Run script, giving test-label as argument. For example: 
> "./testscript.sh bigtable-patched"
> 
> Attached is also the patch I used for the tests.
> 
> I would appreciate it if people would download the patch and the script 
> and repeat the tests on different hardware. I'm particularly interested 
> in testing on a box with good I/O hardware where selects on unpatched 
> PostgreSQL are bottlenecked by CPU.
> 
> Barring any surprises I'm going to fix the remaining issue and submit a 
> final patch, probably in the weekend.
> 
> (*) The issue with this patch is that if the buffer cache is completely 
> filled with dirty buffers that need a WAL flush to evict, the buffer 
> ring code will get into an infinite loop trying to find one that doesn't 
> need a WAL flush. Should be simple to fix.
> 
> 
> ------------------------------------------------------------------------
> 
> Index: src/backend/access/heap/heapam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
> retrieving revision 1.232
> diff -c -r1.232 heapam.c
> *** src/backend/access/heap/heapam.c	8 Apr 2007 01:26:27 -0000	1.232
> --- src/backend/access/heap/heapam.c	16 May 2007 11:35:14 -0000
> ***************
> *** 83,88 ****
> --- 83,96 ----
>   	 */
>   	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
>   
> + 	/* A scan on a table smaller than shared_buffers is treated like random
> + 	 * access, but bigger scans should use the bulk read replacement policy.
> + 	 */
> + 	if (scan->rs_nblocks > NBuffers)
> + 		scan->rs_accesspattern = AP_BULKREAD;
> + 	else
> + 		scan->rs_accesspattern = AP_NORMAL;
> + 
>   	scan->rs_inited = false;
>   	scan->rs_ctup.t_data = NULL;
>   	ItemPointerSetInvalid(&scan->rs_ctup.t_self);
> ***************
> *** 123,133 ****
> --- 131,146 ----
>   
>   	Assert(page < scan->rs_nblocks);
>   
> + 	/* Read the page with the right strategy */
> + 	SetAccessPattern(scan->rs_accesspattern);
> + 
>   	scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf,
>   										 scan->rs_rd,
>   										 page);
>   	scan->rs_cblock = page;
>   
> + 	SetAccessPattern(AP_NORMAL);
> + 
>   	if (!scan->rs_pageatatime)
>   		return;
>   
> Index: src/backend/access/transam/xlog.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.268
> diff -c -r1.268 xlog.c
> *** src/backend/access/transam/xlog.c	30 Apr 2007 21:01:52 -0000	1.268
> --- src/backend/access/transam/xlog.c	15 May 2007 16:23:30 -0000
> ***************
> *** 1668,1673 ****
> --- 1668,1700 ----
>   }
>   
>   /*
> +  * Returns true if 'record' hasn't been flushed to disk yet.
> +  */
> + bool
> + XLogNeedsFlush(XLogRecPtr record)
> + {
> + 	/* Quick exit if already known flushed */
> + 	if (XLByteLE(record, LogwrtResult.Flush))
> + 		return false;
> + 
> + 	/* read LogwrtResult and update local state */
> + 	{
> + 		/* use volatile pointer to prevent code rearrangement */
> + 		volatile XLogCtlData *xlogctl = XLogCtl;
> + 
> + 		SpinLockAcquire(&xlogctl->info_lck);
> + 		LogwrtResult = xlogctl->LogwrtResult;
> + 		SpinLockRelease(&xlogctl->info_lck);
> + 	}
> + 
> + 	/* check again */
> + 	if (XLByteLE(record, LogwrtResult.Flush))
> + 		return false;
> + 
> + 	return true;
> + }
> + 
> + /*
>    * Ensure that all XLOG data through the given position is flushed to disk.
>    *
>    * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
> Index: src/backend/commands/copy.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
> retrieving revision 1.283
> diff -c -r1.283 copy.c
> *** src/backend/commands/copy.c	27 Apr 2007 22:05:46 -0000	1.283
> --- src/backend/commands/copy.c	15 May 2007 17:05:29 -0000
> ***************
> *** 1876,1881 ****
> --- 1876,1888 ----
>   	nfields = file_has_oids ? (attr_count + 1) : attr_count;
>   	field_strings = (char **) palloc(nfields * sizeof(char *));
>   
> + 	/* Use the special COPY buffer replacement strategy if WAL-logging
> + 	 * is enabled. If it's not, the pages we're writing are dirty but
> + 	 * don't need a WAL flush to write out, so the BULKREAD strategy
> + 	 * is more suitable.
> + 	 */
> + 	SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD);
> + 
>   	/* Initialize state variables */
>   	cstate->fe_eof = false;
>   	cstate->eol_type = EOL_UNKNOWN;
> ***************
> *** 2161,2166 ****
> --- 2168,2176 ----
>   							cstate->filename)));
>   	}
>   
> + 	/* Reset buffer replacement strategy */
> + 	SetAccessPattern(AP_NORMAL);
> + 
>   	/* 
>   	 * If we skipped writing WAL, then we need to sync the heap (but not
>   	 * indexes since those use WAL anyway)
> Index: src/backend/commands/vacuum.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v
> retrieving revision 1.350
> diff -c -r1.350 vacuum.c
> *** src/backend/commands/vacuum.c	16 Apr 2007 18:29:50 -0000	1.350
> --- src/backend/commands/vacuum.c	15 May 2007 17:06:18 -0000
> ***************
> *** 421,431 ****
>   				 * Tell the buffer replacement strategy that vacuum is causing
>   				 * the IO
>   				 */
> ! 				StrategyHintVacuum(true);
>   
>   				analyze_rel(relid, vacstmt);
>   
> ! 				StrategyHintVacuum(false);
>   
>   				if (use_own_xacts)
>   					CommitTransactionCommand();
> --- 421,431 ----
>   				 * Tell the buffer replacement strategy that vacuum is causing
>   				 * the IO
>   				 */
> ! 				SetAccessPattern(AP_VACUUM);
>   
>   				analyze_rel(relid, vacstmt);
>   
> ! 				SetAccessPattern(AP_NORMAL);
>   
>   				if (use_own_xacts)
>   					CommitTransactionCommand();
> ***************
> *** 442,448 ****
>   		/* Make sure cost accounting is turned off after error */
>   		VacuumCostActive = false;
>   		/* And reset buffer replacement strategy, too */
> ! 		StrategyHintVacuum(false);
>   		PG_RE_THROW();
>   	}
>   	PG_END_TRY();
> --- 442,448 ----
>   		/* Make sure cost accounting is turned off after error */
>   		VacuumCostActive = false;
>   		/* And reset buffer replacement strategy, too */
> ! 		SetAccessPattern(AP_NORMAL);
>   		PG_RE_THROW();
>   	}
>   	PG_END_TRY();
> ***************
> *** 1088,1094 ****
>   	 * Tell the cache replacement strategy that vacuum is causing all
>   	 * following IO
>   	 */
> ! 	StrategyHintVacuum(true);
>   
>   	/*
>   	 * Do the actual work --- either FULL or "lazy" vacuum
> --- 1088,1094 ----
>   	 * Tell the cache replacement strategy that vacuum is causing all
>   	 * following IO
>   	 */
> ! 	SetAccessPattern(AP_VACUUM);
>   
>   	/*
>   	 * Do the actual work --- either FULL or "lazy" vacuum
> ***************
> *** 1098,1104 ****
>   	else
>   		lazy_vacuum_rel(onerel, vacstmt);
>   
> ! 	StrategyHintVacuum(false);
>   
>   	/* all done with this class, but hold lock until commit */
>   	relation_close(onerel, NoLock);
> --- 1098,1104 ----
>   	else
>   		lazy_vacuum_rel(onerel, vacstmt);
>   
> ! 	SetAccessPattern(AP_NORMAL);
>   
>   	/* all done with this class, but hold lock until commit */
>   	relation_close(onerel, NoLock);
> Index: src/backend/storage/buffer/README
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v
> retrieving revision 1.11
> diff -c -r1.11 README
> *** src/backend/storage/buffer/README	23 Jul 2006 03:07:58 -0000	1.11
> --- src/backend/storage/buffer/README	16 May 2007 11:43:11 -0000
> ***************
> *** 152,159 ****
>   a field to show which backend is doing its I/O).
>   
>   
> ! Buffer replacement strategy
> ! ---------------------------
>   
>   There is a "free list" of buffers that are prime candidates for replacement.
>   In particular, buffers that are completely free (contain no valid page) are
> --- 152,159 ----
>   a field to show which backend is doing its I/O).
>   
>   
> ! Normal buffer replacement strategy
> ! ----------------------------------
>   
>   There is a "free list" of buffers that are prime candidates for replacement.
>   In particular, buffers that are completely free (contain no valid page) are
> ***************
> *** 199,221 ****
>   have to give up and try another buffer.  This however is not a concern
>   of the basic select-a-victim-buffer algorithm.)
>   
> - A special provision is that while running VACUUM, a backend does not
> - increment the usage count on buffers it accesses.  In fact, if ReleaseBuffer
> - sees that it is dropping the pin count to zero and the usage count is zero,
> - then it appends the buffer to the tail of the free list.  (This implies that
> - VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
> - this shouldn't create much of a contention problem.)  This provision
> - encourages VACUUM to work in a relatively small number of buffers rather
> - than blowing out the entire buffer cache.  It is reasonable since a page
> - that has been touched only by VACUUM is unlikely to be needed again soon.
> - 
> - Since VACUUM usually requests many pages very fast, the effect of this is that
> - it will get back the very buffers it filled and possibly modified on the next
> - call and will therefore do its work in a few shared memory buffers, while
> - being able to use whatever it finds in the cache already.  This also implies
> - that most of the write traffic caused by a VACUUM will be done by the VACUUM
> - itself and not pushed off onto other processes.
>   
>   
>   Background writer's processing
>   ------------------------------
> --- 199,243 ----
>   have to give up and try another buffer.  This however is not a concern
>   of the basic select-a-victim-buffer algorithm.)
>   
>   
> + Buffer ring replacement strategy
> + ---------------------------------
> + 
> + When running a query that needs to access a large number of pages, like VACUUM,
> + COPY, or a large sequential scan, a different strategy is used.  A page that
> + has been touched only by such a scan is unlikely to be needed again soon, so
> + instead of running the normal clock sweep algorithm and blowing out the entire
> + buffer cache, a small ring of buffers is allocated using the normal clock sweep
> + algorithm and those buffers are reused for the whole scan.  This also implies
> + that most of the write traffic caused by such a statement will be done by the
> + backend itself and not pushed off onto other processes.
> + 
> + The size of the ring used depends on the kind of scan:
> + 
> + For sequential scans, a small 256 KB ring is used. That's small enough to fit
> + in L2 cache, which makes transferring pages from OS cache to shared buffer
> + cache efficient. Even less would often be enough, but the ring must be big
> + enough to accommodate all pages in the scan that are pinned concurrently. 
> + 256 KB should also be enough to leave a small cache trail for other backends to
> + join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer
> + is removed from the ring and a replacement buffer is chosen using the normal
> + replacement strategy. In a scan that modifies every page in the scan, like a
> + bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the
> + ring strategy effectively degrades to the normal strategy.
> + 
> + VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not
> + removed from the ring. WAL is flushed instead to allow reuse of the buffers.
> + Before introducing the buffer ring strategy in 8.3, buffers were put to the
> + freelist, which was effectively a buffer ring of 1 buffer.
> + 
> + COPY behaves like VACUUM, but a much larger ring is used. The ring size is
> + chosen to be twice the WAL segment size. This avoids polluting the buffer cache
> + like the clock sweep would do, and using a ring larger than WAL segment size
> + avoids having to do any extra WAL flushes, since a WAL segment will always be
> + filled, forcing a WAL flush, before looping through the buffer ring and bumping
> + into a buffer that would force a WAL flush. However, for non-WAL-logged COPY
> + operations the smaller 256 KB ring is used because WAL flushes are not needed
> + to write the buffers.
>   
>   Background writer's processing
>   ------------------------------
> Index: src/backend/storage/buffer/bufmgr.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
> retrieving revision 1.218
> diff -c -r1.218 bufmgr.c
> *** src/backend/storage/buffer/bufmgr.c	2 May 2007 23:34:48 -0000	1.218
> --- src/backend/storage/buffer/bufmgr.c	16 May 2007 12:34:10 -0000
> ***************
> *** 419,431 ****
>   	/* Loop here in case we have to try another victim buffer */
>   	for (;;)
>   	{
>   		/*
>   		 * Select a victim buffer.	The buffer is returned with its header
>   		 * spinlock still held!  Also the BufFreelistLock is still held, since
>   		 * it would be bad to hold the spinlock while possibly waking up other
>   		 * processes.
>   		 */
> ! 		buf = StrategyGetBuffer();
>   
>   		Assert(buf->refcount == 0);
>   
> --- 419,433 ----
>   	/* Loop here in case we have to try another victim buffer */
>   	for (;;)
>   	{
> + 		bool lock_held;
> + 
>   		/*
>   		 * Select a victim buffer.	The buffer is returned with its header
>   		 * spinlock still held!  Also the BufFreelistLock is still held, since
>   		 * it would be bad to hold the spinlock while possibly waking up other
>   		 * processes.
>   		 */
> ! 		buf = StrategyGetBuffer(&lock_held);
>   
>   		Assert(buf->refcount == 0);
>   
> ***************
> *** 436,442 ****
>   		PinBuffer_Locked(buf);
>   
>   		/* Now it's safe to release the freelist lock */
> ! 		LWLockRelease(BufFreelistLock);
>   
>   		/*
>   		 * If the buffer was dirty, try to write it out.  There is a race
> --- 438,445 ----
>   		PinBuffer_Locked(buf);
>   
>   		/* Now it's safe to release the freelist lock */
> ! 		if (lock_held)
> ! 			LWLockRelease(BufFreelistLock);
>   
>   		/*
>   		 * If the buffer was dirty, try to write it out.  There is a race
> ***************
> *** 464,469 ****
> --- 467,489 ----
>   			 */
>   			if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
>   			{
> + 				/* In BULKREAD-mode, check if a WAL flush would be needed to
> + 				 * evict this buffer. If so, ask the replacement strategy if
> + 				 * we should go ahead and do it or choose another victim.
> + 				 */
> + 				if (active_access_pattern == AP_BULKREAD)
> + 				{
> + 					if (XLogNeedsFlush(BufferGetLSN(buf)))
> + 					{
> + 						if (StrategyRejectBuffer(buf))
> + 						{
> + 							LWLockRelease(buf->content_lock);
> + 							UnpinBuffer(buf, true, false);
> + 							continue;
> + 						}
> + 					}
> + 				}
> + 
>   				FlushBuffer(buf, NULL);
>   				LWLockRelease(buf->content_lock);
>   			}
> ***************
> *** 925,932 ****
>   	PrivateRefCount[b]--;
>   	if (PrivateRefCount[b] == 0)
>   	{
> - 		bool		immed_free_buffer = false;
> - 
>   		/* I'd better not still hold any locks on the buffer */
>   		Assert(!LWLockHeldByMe(buf->content_lock));
>   		Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
> --- 945,950 ----
> ***************
> *** 940,956 ****
>   		/* Update buffer usage info, unless this is an internal access */
>   		if (normalAccess)
>   		{
> ! 			if (!strategy_hint_vacuum)
>   			{
> ! 				if (buf->usage_count < BM_MAX_USAGE_COUNT)
> ! 					buf->usage_count++;
>   			}
>   			else
> ! 			{
> ! 				/* VACUUM accesses don't bump usage count, instead... */
> ! 				if (buf->refcount == 0 && buf->usage_count == 0)
> ! 					immed_free_buffer = true;
> ! 			}
>   		}
>   
>   		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
> --- 958,975 ----
>   		/* Update buffer usage info, unless this is an internal access */
>   		if (normalAccess)
>   		{
> ! 			if (active_access_pattern != AP_NORMAL)
>   			{
> ! 				/* We don't want large one-off scans like vacuum to inflate 
> ! 				 * the usage_count. We do want to set it to 1, though, to keep
> ! 				 * other backends from hijacking it from the buffer ring.
> ! 				 */
> ! 				if (buf->usage_count == 0)
> ! 					buf->usage_count = 1;
>   			}
>   			else
> ! 			if (buf->usage_count < BM_MAX_USAGE_COUNT)
> ! 				buf->usage_count++;
>   		}
>   
>   		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
> ***************
> *** 965,978 ****
>   		}
>   		else
>   			UnlockBufHdr(buf);
> - 
> - 		/*
> - 		 * If VACUUM is releasing an otherwise-unused buffer, send it to the
> - 		 * freelist for near-term reuse.  We put it at the tail so that it
> - 		 * won't be used before any invalid buffers that may exist.
> - 		 */
> - 		if (immed_free_buffer)
> - 			StrategyFreeBuffer(buf, false);
>   	}
>   }
>   
> --- 984,989 ----
> Index: src/backend/storage/buffer/freelist.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v
> retrieving revision 1.58
> diff -c -r1.58 freelist.c
> *** src/backend/storage/buffer/freelist.c	5 Jan 2007 22:19:37 -0000	1.58
> --- src/backend/storage/buffer/freelist.c	17 May 2007 16:12:56 -0000
> ***************
> *** 18,23 ****
> --- 18,25 ----
>   #include "storage/buf_internals.h"
>   #include "storage/bufmgr.h"
>   
> + #include "utils/memutils.h"
> + 
>   
>   /*
>    * The shared freelist control information.
> ***************
> *** 39,47 ****
>   /* Pointers to shared state */
>   static BufferStrategyControl *StrategyControl = NULL;
>   
> ! /* Backend-local state about whether currently vacuuming */
> ! bool		strategy_hint_vacuum = false;
>   
>   
>   /*
>    * StrategyGetBuffer
> --- 41,53 ----
>   /* Pointers to shared state */
>   static BufferStrategyControl *StrategyControl = NULL;
>   
> ! /* Currently active access pattern hint. */
> ! AccessPattern active_access_pattern = AP_NORMAL;
>   
> + /* prototypes for internal functions */
> + static volatile BufferDesc *GetBufferFromRing(void);
> + static void PutBufferToRing(volatile BufferDesc *buf);
> + static void InitRing(void);
>   
>   /*
>    * StrategyGetBuffer
> ***************
> *** 51,67 ****
>    *	the selected buffer must not currently be pinned by anyone.
>    *
>    *	To ensure that no one else can pin the buffer before we do, we must
> !  *	return the buffer with the buffer header spinlock still held.  That
> !  *	means that we return with the BufFreelistLock still held, as well;
> !  *	the caller must release that lock once the spinlock is dropped.
>    */
>   volatile BufferDesc *
> ! StrategyGetBuffer(void)
>   {
>   	volatile BufferDesc *buf;
>   	int			trycounter;
>   
>   	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
>   
>   	/*
>   	 * Try to get a buffer from the freelist.  Note that the freeNext fields
> --- 57,89 ----
>    *	the selected buffer must not currently be pinned by anyone.
>    *
>    *	To ensure that no one else can pin the buffer before we do, we must
> !  *	return the buffer with the buffer header spinlock still held.  If
> !  *	*lock_held is set at return, we return with the BufFreelistLock still
> !  *	held, as well;	the caller must release that lock once the spinlock is
> !  *	dropped.
>    */
>   volatile BufferDesc *
> ! StrategyGetBuffer(bool *lock_held)
>   {
>   	volatile BufferDesc *buf;
>   	int			trycounter;
>   
> + 	/* Get a buffer from the ring if we're doing a bulk scan */
> + 	if (active_access_pattern != AP_NORMAL)
> + 	{
> + 		buf = GetBufferFromRing();
> + 		if (buf != NULL)
> + 		{
> + 			*lock_held = false;
> + 			return buf;
> + 		}
> + 	}
> + 
> + 	/*
> + 	 * If our selected buffer wasn't available, pick another...
> + 	 */
>   	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> + 	*lock_held = true;
>   
>   	/*
>   	 * Try to get a buffer from the freelist.  Note that the freeNext fields
> ***************
> *** 86,96 ****
>   		 */
>   		LockBufHdr(buf);
>   		if (buf->refcount == 0 && buf->usage_count == 0)
>   			return buf;
>   		UnlockBufHdr(buf);
>   	}
>   
> ! 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
>   	trycounter = NBuffers;
>   	for (;;)
>   	{
> --- 108,122 ----
>   		 */
>   		LockBufHdr(buf);
>   		if (buf->refcount == 0 && buf->usage_count == 0)
> + 		{
> + 			if (active_access_pattern != AP_NORMAL)
> + 				PutBufferToRing(buf);
>   			return buf;
> + 		}
>   		UnlockBufHdr(buf);
>   	}
>   
> ! 	/* Nothing on the freelist, so run the shared "clock sweep" algorithm */
>   	trycounter = NBuffers;
>   	for (;;)
>   	{
> ***************
> *** 105,111 ****
> --- 131,141 ----
>   		 */
>   		LockBufHdr(buf);
>   		if (buf->refcount == 0 && buf->usage_count == 0)
> + 		{
> + 			if (active_access_pattern != AP_NORMAL)
> + 				PutBufferToRing(buf);
>   			return buf;
> + 		}
>   		if (buf->usage_count > 0)
>   		{
>   			buf->usage_count--;
> ***************
> *** 191,204 ****
>   }
>   
>   /*
> !  * StrategyHintVacuum -- tell us whether VACUUM is active
>    */
>   void
> ! StrategyHintVacuum(bool vacuum_active)
>   {
> ! 	strategy_hint_vacuum = vacuum_active;
> ! }
>   
>   
>   /*
>    * StrategyShmemSize
> --- 221,245 ----
>   }
>   
>   /*
> !  * SetAccessPattern -- Sets the active access pattern hint
> !  *
> !  * Caller is responsible for resetting the hint to AP_NORMAL after the bulk
> !  * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of
> !  * the other strategies, for example in a query with one large sequential scan
> !  * nested loop joined to an index scan. Index tuples should be fetched with the
> !  * normal strategy and the pages from the seq scan should be read in with the
> !  * AP_BULKREAD strategy. The ring won't be affected by such switching, however
> !  * switching to an access pattern with different ring size will invalidate the
> !  * old ring.
>    */
>   void
> ! SetAccessPattern(AccessPattern new_pattern)
>   {
> ! 	active_access_pattern = new_pattern;
>   
> + 	if (active_access_pattern != AP_NORMAL)
> + 		InitRing();
> + }
>   
>   /*
>    * StrategyShmemSize
> ***************
> *** 274,276 ****
> --- 315,498 ----
>   	else
>   		Assert(!init);
>   }
> + 
> + /* ----------------------------------------------------------------
> +  *				Backend-private buffer ring management
> +  * ----------------------------------------------------------------
> +  */
> + 
> + /*
> +  * Ring sizes for different access patterns. See README for the rationale
> +  * of these.
> +  */
> + #define BULKREAD_RING_SIZE	256 * 1024 / BLCKSZ
> + #define VACUUM_RING_SIZE	256 * 1024 / BLCKSZ
> + #define COPY_RING_SIZE		Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2)
> + 
> + /*
> +  * BufferRing is an array of buffer ids, and RingSize it's size in number of
> +  * elements. It's allocated in TopMemoryContext the first time it's needed.
> +  */
> + static int *BufferRing = NULL;
> + static int RingSize = 0;
> + 
> + /* Index of the "current" slot in the ring. It's advanced every time a buffer
> +  * is handed out from the ring with GetBufferFromRing and it points to the 
> +  * last buffer returned from the ring. RingCurSlot + 1 is the next victim
> +  * GetBufferRing will hand out.
> +  */
> + static int RingCurSlot = 0;
> + 
> + /* magic value to mark empty slots in the ring */
> + #define BUF_ID_NOT_SET -1
> + 
> + 
> + /*
> +  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
> +  *		ring is empty.
> +  *
> +  * The bufhdr spin lock is held on the returned buffer.
> +  */
> + static volatile BufferDesc *
> + GetBufferFromRing(void)
> + {
> + 	volatile BufferDesc *buf;
> + 
> + 	/* ring should be initialized by now */
> + 	Assert(RingSize > 0 && BufferRing != NULL);
> + 
> + 	/* Run private "clock cycle" */
> + 	if (++RingCurSlot >= RingSize)
> + 		RingCurSlot = 0;
> + 
> + 	/*
> + 	 * If that slot hasn't been filled yet, tell the caller to allocate
> + 	 * a new buffer with the normal allocation strategy. He will then
> + 	 * fill this slot by calling PutBufferToRing with the new buffer.
> + 	 */
> + 	if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
> + 		return NULL;
> + 
> + 	buf = &BufferDescriptors[BufferRing[RingCurSlot]];
> + 
> + 	/*
> + 	 * If the buffer is pinned we cannot use it under any circumstances.
> + 	 * If usage_count == 0 then the buffer is fair game. 
> + 	 *
> + 	 * We also choose this buffer if usage_count == 1. Strictly, this
> + 	 * might sometimes be the wrong thing to do, but we rely on the high
> + 	 * probability that it was this process that last touched the buffer.
> + 	 * If it wasn't, we'll choose a suboptimal victim, but  it shouldn't
> + 	 * make any difference in the big scheme of things.
> + 	 *
> + 	 */
> + 	LockBufHdr(buf);
> + 	if (buf->refcount == 0 && buf->usage_count <= 1)
> + 		return buf;
> + 	UnlockBufHdr(buf);
> + 
> + 	return NULL;
> + }
> + 
> + /*
> +  * PutBufferToRing -- adds a buffer to the buffer ring
> +  *
> +  * Caller must hold the buffer header spinlock on the buffer.
> +  */
> + static void
> + PutBufferToRing(volatile BufferDesc *buf)
> + {
> + 	/* ring should be initialized by now */
> + 	Assert(RingSize > 0 && BufferRing != NULL);
> + 
> + 	if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
> + 		BufferRing[RingCurSlot] = buf->buf_id;
> + }
> + 
> + /*
> +  * Initializes a ring buffer with correct size for the currently
> +  * active strategy. Does nothing if the ring already has the right size.
> +  */
> + static void
> + InitRing(void)
> + {
> + 	int new_size;
> + 	int old_size = RingSize;
> + 	int i;
> + 	MemoryContext oldcxt;
> + 
> + 	/* Determine new size */
> + 
> + 	switch(active_access_pattern)
> + 	{
> + 		case AP_BULKREAD:
> + 			new_size = BULKREAD_RING_SIZE;
> + 			break;
> + 		case AP_COPY:
> + 			new_size = COPY_RING_SIZE;
> + 			break;
> + 		case AP_VACUUM:
> + 			new_size = VACUUM_RING_SIZE;
> + 			break;
> + 		default:
> + 			elog(ERROR, "unexpected buffer cache strategy %d", 
> + 				 active_access_pattern);
> + 			return; /* keep compile happy */
> + 	}
> + 
> + 	/*
> + 	 * Seq scans set and reset the strategy on every page, so we better exit
> + 	 * quickly if no change in size is needed.
> + 	 */
> + 	if (new_size == old_size)
> + 		return;
> + 
> + 	/* Allocate array */
> + 
> + 	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
> + 
> + 	if (old_size == 0)
> + 	{
> + 		Assert(BufferRing == NULL);
> + 		BufferRing = palloc(new_size * sizeof(int));
> + 	}
> + 	else
> + 		BufferRing = repalloc(BufferRing, new_size * sizeof(int));
> + 
> + 	MemoryContextSwitchTo(oldcxt);
> + 
> + 	for(i = 0; i < new_size; i++)
> + 		BufferRing[i] = BUF_ID_NOT_SET;
> + 
> + 	RingCurSlot = 0;
> + 	RingSize = new_size;
> + }
> + 
> + /*
> +  * Buffer manager calls this function in AP_BULKREAD mode when the
> +  * buffer handed to it turns out to need a WAL flush to write out. This
> +  * gives the strategy a second chance to choose another victim.
> +  *
> +  * Returns true if buffer manager should ask for a new victim, and false
> +  * if WAL should be flushed and this buffer used.
> +  */
> + bool
> + StrategyRejectBuffer(volatile BufferDesc *buf)
> + {
> + 	Assert(RingSize > 0);
> + 
> + 	if (BufferRing[RingCurSlot] == buf->buf_id)
> + 	{
> + 		BufferRing[RingCurSlot] = BUF_ID_NOT_SET;
> + 		return true;
> + 	}
> + 	else
> + 	{
> + 		/* Apparently the buffer didn't come from the ring. We don't want to
> + 		 * mess with how the clock sweep works; in worst case there's no
> + 		 * buffers in the buffer cache that can be reused without a WAL flush,
> + 		 * and we'd get into an endless loop trying.
> + 		 */
> + 		return false;
> + 	}
> + }
> Index: src/include/access/relscan.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v
> retrieving revision 1.52
> diff -c -r1.52 relscan.h
> *** src/include/access/relscan.h	20 Jan 2007 18:43:35 -0000	1.52
> --- src/include/access/relscan.h	15 May 2007 17:01:31 -0000
> ***************
> *** 28,33 ****
> --- 28,34 ----
>   	ScanKey		rs_key;			/* array of scan key descriptors */
>   	BlockNumber rs_nblocks;		/* number of blocks to scan */
>   	bool		rs_pageatatime; /* verify visibility page-at-a-time? */
> + 	AccessPattern rs_accesspattern; /* access pattern to use for reads */
>   
>   	/* scan current state */
>   	bool		rs_inited;		/* false = scan not init'd yet */
> Index: src/include/access/xlog.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
> retrieving revision 1.76
> diff -c -r1.76 xlog.h
> *** src/include/access/xlog.h	5 Jan 2007 22:19:51 -0000	1.76
> --- src/include/access/xlog.h	14 May 2007 21:22:40 -0000
> ***************
> *** 151,156 ****
> --- 151,157 ----
>   
>   extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
>   extern void XLogFlush(XLogRecPtr RecPtr);
> + extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
>   
>   extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
>   extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
> Index: src/include/storage/buf_internals.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
> retrieving revision 1.89
> diff -c -r1.89 buf_internals.h
> *** src/include/storage/buf_internals.h	5 Jan 2007 22:19:57 -0000	1.89
> --- src/include/storage/buf_internals.h	15 May 2007 17:07:59 -0000
> ***************
> *** 16,21 ****
> --- 16,22 ----
>   #define BUFMGR_INTERNALS_H
>   
>   #include "storage/buf.h"
> + #include "storage/bufmgr.h"
>   #include "storage/lwlock.h"
>   #include "storage/shmem.h"
>   #include "storage/spin.h"
> ***************
> *** 168,174 ****
>   extern BufferDesc *LocalBufferDescriptors;
>   
>   /* in freelist.c */
> ! extern bool strategy_hint_vacuum;
>   
>   /* event counters in buf_init.c */
>   extern long int ReadBufferCount;
> --- 169,175 ----
>   extern BufferDesc *LocalBufferDescriptors;
>   
>   /* in freelist.c */
> ! extern AccessPattern active_access_pattern;
>   
>   /* event counters in buf_init.c */
>   extern long int ReadBufferCount;
> ***************
> *** 184,195 ****
>    */
>   
>   /* freelist.c */
> ! extern volatile BufferDesc *StrategyGetBuffer(void);
>   extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
>   extern int	StrategySyncStart(void);
>   extern Size StrategyShmemSize(void);
>   extern void StrategyInitialize(bool init);
>   
>   /* buf_table.c */
>   extern Size BufTableShmemSize(int size);
>   extern void InitBufTable(int size);
> --- 185,198 ----
>    */
>   
>   /* freelist.c */
> ! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held);
>   extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
>   extern int	StrategySyncStart(void);
>   extern Size StrategyShmemSize(void);
>   extern void StrategyInitialize(bool init);
>   
> + extern bool StrategyRejectBuffer(volatile BufferDesc *buf);
> + 
>   /* buf_table.c */
>   extern Size BufTableShmemSize(int size);
>   extern void InitBufTable(int size);
> Index: src/include/storage/bufmgr.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
> retrieving revision 1.103
> diff -c -r1.103 bufmgr.h
> *** src/include/storage/bufmgr.h	2 May 2007 23:18:03 -0000	1.103
> --- src/include/storage/bufmgr.h	15 May 2007 17:07:02 -0000
> ***************
> *** 48,53 ****
> --- 48,61 ----
>   #define BUFFER_LOCK_SHARE		1
>   #define BUFFER_LOCK_EXCLUSIVE	2
>   
> + typedef enum AccessPattern
> + {
> + 	AP_NORMAL,		/* Normal random access */
> +     AP_BULKREAD,	/* Large read-only scan (hint bit updates are ok) */
> +     AP_COPY,		/* Large updating scan, like COPY with WAL enabled */
> +     AP_VACUUM,		/* VACUUM */
> + } AccessPattern;
> + 
>   /*
>    * These routines are beaten on quite heavily, hence the macroization.
>    */
> ***************
> *** 157,162 ****
>   extern void AtProcExit_LocalBuffers(void);
>   
>   /* in freelist.c */
> ! extern void StrategyHintVacuum(bool vacuum_active);
>   
>   #endif
> --- 165,170 ----
>   extern void AtProcExit_LocalBuffers(void);
>   
>   /* in freelist.c */
> ! extern void SetAccessPattern(AccessPattern new_pattern);
>   
>   #endif


-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment: gendata.c
Description: text/x-csrc (406 bytes)

In response to

pgsql-patches by date

Next:From: Guillaume LelargeDate: 2007-05-21 07:43:56
Subject: Patch needed fot dt.h
Previous:From: NikhilSDate: 2007-05-21 06:53:42
Subject: Re: CREATE TABLE LIKE INCLUDING INDEXES support

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group