Skip site navigation (1) Skip section navigation (2)

Implementation Proposal For Add Free Behind Capability For Large Sequential Scan

From: Amit Kumar Khare <skamit2000(at)yahoo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: pgman(at)candle(dot)pha(dot)pa(dot)us, param_inder(at)yahoo(dot)com, kharebm(at)yahoo(dot)com
Subject: Implementation Proposal For Add Free Behind Capability For Large Sequential Scan
Date: 2002-02-23 13:30:36
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
MP2   amitk  Normal  amitk  2  355  2002-02-23T13:22:00Z  2002-02-23T13:22:00Z  4  1910  10890  90  21  13373  9.2720    1 


Hereis at your disposal my implementation Idea of a TODO item titled AddFree-Behind Capability For Large Sequential Scans. I cordially invite yourcomments.


Part I:


Problem: The present default page replacement policy of PostgreSQLis Least Recently Used (LRU). This policy may be good for any other databaseaccess patterns but it is not efficient for Large Sequential Scan, since itmakes cache useless. For example:

Let shared buffer size be 1MB and the size of relation is 2MB. Lets assume sequential scan is applied onthis relation over and over again. The cache becomes useless because by the time the second read of the table happensthe first 1mb has already been forced out of the cache. 


The problemappeared in the TODO list of PostgreSQL under section CACHE titled:

AddFree-Behind Capability For Large Sequential Scans 


Solution: In hispaper Operating System Support For Database Management, ProfessorMichael Stonebraker identifies various database access patterns in INGRES(grand father of PostgreSQL) suggest that for sequential access to blocks,which will be cyclically rereferenced, the buffer management strategy should beto Toss Immediately, unless all n pages of relation can be kept in the cache.He further suggest that for buffer management to be of some use to applicationlike DBMS there should be some provision that it accepts advice from anapplication concerning replacement strategy.


Considering thesesuggestions as our baseline we have decided to implement Most Recently Used(MRU) cache replacement policy. Our task will be to identify a large sequentialscan and then advice the buffer manager to apply MRU for cache replacement.


Detailed Plan ofImplementation:


(I)               Identifyingthe query execution flow control


Goal of identifying the query execution flow: (i) This is necessary as we have to figure out a spotwhere we will identify that the block access is actually a large sequentialscan and from this point onward we think to send advice (in form of parameter )to buffer manager. 

(ii) Through this query execution trace our aim is toreach at one of the files listed below named freelist.c. This is the filewhere all the funcitions related to replacement policy are implemented. Henceour trace ends after reaching freelist.c.



(1)   Files Involved (Important files listed):


(1)    bin/psql/mainloop.c

(2)    bin/psql/common.c

(3)    src/interfaces/libpq/fe-exec.c

(4)    src/backend/tcop/postgres.c

(5)    src/backend/tcop/pquery.c

(6)    src/backend/executor/execMain.c

(7)    src/backend/executor/execProcnode.c

(8)    src/backend/access/heap/heapam.c

(9)    src/backend/utils/cache/relcache.c














(2)   Function Call Trace:



Notation Convention: (i) Arrow  will represent call

 (ii)Functions are listed as

<FileName>.<FunctionName> for example common. SendQuery i.e. SendQuery function is defined in file name common


(i)                Querysubmitted to psql


mainloop.Mainloop common.SendQuery

common.SendQuery  fe-exec.PQexec


-         SendQuery send the querystring to the backen and is a front door way to send a query.

-         PQexec sends the queryto the backend and package up the results.


(ii)              QueryExecution from postgres backend process


postgres. pg_exec_query_string  pquery.ProcessQuery

pquery.ProcessQuery  execMain.ExecutorStart 


(Thisroutine is called at the beginning of any execution of any


pquery.ProcessQuery  execMain.ExecutorRun 

(This is the main routine of the executor module. It accepts

 the query descriptor from the traffic copand executes the

 query plan.)

pquery.ProcessQuery  execMain.ExecutorEnd 

(Thisroutine must be called at the end of execution of any

query plan)



we will follow the executiontrace from execMain.ExecutorRun


execMain.ExecutorRun  execMain.ExecutePlan execProcnode.ExecProcNode  nodeSeqscan.ExecSeqScan  execScan.ExecScan 


ExecSeqScan, scans therelation sequentially and returns the next qualifying

tuple. It passes SeqNext as the accessmethod to ExecScan as one of the parameters.


SeqNext is actually afunction defined in the file nodeSeqscan this function is called through afunction pointer (slot=(*accessMtd) (node)) in an infinite forloop untill the slot returned by it contains NULL.


execScan.ExecScan  nodeSeqscan.SeqNext heapam.heap_getnext  heapam.heapgettup


heapgettup returnsimmediately if the relation is empty after making a call tobufmgr.ReleaseBuffer. Otherwise bufmgr.ReleaseAndReadBuffer is called.


bufmgr.ReleaseAndReadBuffer bufmgr.ReadBufferInternal


ReadBufferInternal callssmgr.smgrnblocks to get the accurate number of POSTGRES blocks in the suppliedrelation if caller asks for P_NEW.


bufmgr.ReadBufferInternal bufmgr.BufferAlloc


BufferAlloc creates a newbuffer tag and search (buf_table.BufTableLookup) it in the bufferpool whether it is already present in it. If found in the buffer pool then itpins the buffer and starts the buffer IO (bufmgr.StartBufferIO)


If it is not found in the buffer pool. It initializes a new buffer. Grabs one from thefree list by calling freelist.GetFreeBuffer which returns NULL if all thebuffers are already pinned.


BufferAlloc at times callsfreelist.UnpinBuffer for example if somebody pins the buffer while the currenttransaction was doing I/O. it clears its I/O flags, remove its pin and startall over again. (see comments and code in bufmgr.c from line 456 onward).


bufmgr.BufferAlloc  freelist.UnpinBuffer freelist.AddBufferToFreelist.



( 3) Analysis of Execution trace:


(i)                 freelist.AddBufferToFreelist is the right place to implement any bufferreplacement policy. Look at the comments of this function it clearly states 


In theory, this is the only routine that needs tobe changed if the buffer replacement strategy changes. Just change

themanner in which buffers are added to the freelist queue. Currently, they areadded on an LRU basis.


(ii)We feel bufmgr.ReadBufferInternalis the right place where we can guess whether the buffer access is LargeSequential Scan or not. We say this because 


(a)   We get a pointer (reln)to the Relation as one of the parameter through which we canget the size of the relation (size = reln -> rd_nblocks) 


(b)   Secondly if the blockNum (one of theparameters passesed to it) is equal to P_NEW (this contains InvalidBlockNumberaconstant defined in block.h, means grow the file to get a new page), ReadBufferInternalmakes a call to the function smgr.smgrnblocks to get accuratefile length.


Henceat this point we have accurate knowledge of size of the relation being scaned.


(iii) An external varable NBuffersdeclared in bufmgr.h and defined in globals.c,contains the size of shared buffer. 


(iv) Hence in this functionwe can make a guess whether the scan is large sequential or not by comparingrelation size and shared buffer size. If relations size is greater than shared buffers sizeclearly it is a large sequential scan.


(4) First ProbableImplementation Details:


(i)                 We are thinking ofdeclaring constant in bufmgr.h as following

#define LRU 1

#define MRU 2

(ii)               These constant can bepasses further upto freelist.AddBufferToFreelist. In thisfunction after making a check appropriate policy can be applied.

(iii)            Functionsaffected by the implementation:

Nearlyall the functions in the freelist.c will be changed to implement MRU inaddition two functions bufmgr.ReadBufferInternal and bufmgr.BufferAllocare most likely to be changed.


Second ProbableImplementation Details:


We assume that LRUs SharedFreeListcircular queue keeps all the least recently used buffers at the head of thequeue and most recently used buffers at the tail. If this is the case thenthere is absolutely no need to change the AddBufferToFreelist function infreelist.c, let it add the buffers to shared free list as it is presentlydoing. We just change the freelist.GetFreeBuffer so that insteadof removing (LRU) buffers from head of the queue for MRU scheme we will removethe (MRU) buffers from its tail. ( we are afraid whether GetFreeBuffer isthe only function that gives out the buffer from free list. Just in case ifthere is some other function too, then, yet we have to trace it out ) 




Testing Objective: 


(i)               First, We have to test performance of new alogrithm MRUthat we intend to implement. Then we have to look whether we need specificsequential scan code or not.

(ii)             Second, We would like to compare the performance of MRUwith that of LRU-K algorithm and more specifically LRU-2.


LRU-K a brief introduction:


LRU-K is a new approach to database disk buffering proposed byElizabeth J. ONeil, Patrick E. ONeil and Gerhard Weikum. The basic idea ofLRU-K is to keep track of the times of the last K references to populardatabase pages, using this information to statistically estimate theinterarrival times of references on a page to page basis. The algorithm isself-tuning and adaptive to evolving access patterns. 


A patch for LRU-2 was provided to us by Mr. Bruce Momjian.The algorithm was implemented and tested by Mr Tom Lane. Both are fromPostgreSQL development group. Mr. Tom Lane had tested the algorithm but histest were not that good and the tests did not contain any sequential scan. Soit would be an interesting exercise.


Testing Plan: 


(i)                 We plan to do a big sequential scan and at the same time do asmall sequential scan What shouldhappen is that the small sequntial scan should keep its pages in the bufferwhile the large scan is happening if itdoesn't, we have a problem.

(ii)               We have populated two tables namely simple and simple1 withonly one integer field. We have populated these tables with 8192 values. Wehave to make one of the table even more bigger so that one will be used forlarge sequential scan and another will be used for small sequential scan.

(iii)              Then we plan to put some simple queries like given below in afile say samplequeries.txt


Select *from simple where x = 98765432;

Select *from simple1 where y = 567890432;


Where lets assume simple is largetable and simple1 is smaller table. The large file is should have to be largerthan shared buffer size and smaller file should have to be say 10% smaller thanshared buffer size. 


(iv)             Then we have to run psql with f flag giving thefile as parameter. While doing that we will keep track of the time.


(v)               The other way is that we have set the various statistics flagstrue like parser statistics, planner statistics, executor statistics etc. Allthe statistics gets collected into a logfile. 

Following is one of the samplestatistics generated during our sample query test run on default LRU algorithm:



! systemusage stats:

! 0.050108 elapsed 0.000000 user 0.000000system sec

! [0.050000 user 0.130000 sys total]

! 4/0 [50/0] filesystem blocks in/out

! 4/0 [50/0] page faults/reclaims, 0 [0]swaps

! 0 [0] signals rcvd, 0/0 [2/5] messagesrcvd/sent

! 4/0 [149/17] voluntary/involuntarycontext switches

!postgres usage stats:

! Shared blocks: 3 read, 0 written, buffer hit rate = 25.00%

! Localblocks: 0 read, 0 written, buffer hit rate = 0.00%

! Direct blocks: 0 read, 0 written


(vi)             Another way is to run the query with EXPLAIN command inVERBOSE and ANALYZE mode. This prints the total execution time ofthe query along with other matrics like cost of sequential scan startup andtotal cost etc.



Yardstick of Measurement:


(i)                 We will run the test queries on MRU and LRU-2 implementation.If the execution time for MRU is less than that of LRU-2 then We have to thinkof having a specific code for sequential scan.


(ii)               Otherwise if the execution time of LRU-2 is less than or closeto the MRU execution time than we would suggest replacing the default LRU codeof PostgreSQL with LRU-2 algorithm. Since than there will be only one alogrithmfor all the access patterns and no need of advicing the buffer manager forreplacement policy and no need of having different algorithms for largesequential access.





 Toconclude, fetching data from disk requires at least a factor of 1000 more timethan fetching data from a RAM buffer. For this reason, good use of the buffercan significantly improve the throughput and response time fo anydata-intensive system. Equally important is choosing an efficient buffer replacementalgorithm. Thus we aim at improving the performance of PostgreSQL by studyingwhich algorithm among MRU and LRU-2 will prove to be the best bet.


We thank Professor Kevin Chang and TAs for introducing us tosuch a life time opportunity and for encouraging us to enhance PostgreSQL.


We thank Mr. Bruce Momjian for all his timely help andadvices that he had rendered to uswhole heartedly. It is his encouragement that our efforts are directed twordsreally contributing to PostgreSQL development.

 Amit Kumar Khare 




Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games


pgsql-hackers by date

Next:From: Dave CramerDate: 2002-02-23 15:00:06
Subject: Open magazine article on open source rdbms
Previous:From: Bradley BaetzDate: 2002-02-23 10:21:07

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group