Re: Analysis of ganged WAL writes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Curtis Faith <curtis(at)galtair(dot)com>, Hannu Krosing <hannu(at)tm(dot)ee>, Pgsql-Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Analysis of ganged WAL writes
Date: 2002-10-06 23:07:30
Message-ID: 14533.1033945650@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I said:
> There is a simple error
> in the current code that is easily corrected: in XLogFlush(), the
> wait to acquire WALWriteLock should occur before, not after, we try
> to acquire WALInsertLock and advance our local copy of the write
> request pointer. (To be exact, xlog.c lines 1255-1269 in CVS tip
> ought to be moved down to before line 1275, inside the "if" that
> tests whether we are going to call XLogWrite.)

That patch was not quite right, as it didn't actually flush the
later-arriving data. The correct patch is

*** src/backend/access/transam/xlog.c.orig Thu Sep 26 18:58:33 2002
--- src/backend/access/transam/xlog.c Sun Oct 6 18:45:57 2002
***************
*** 1252,1279 ****
/* done already? */
if (!XLByteLE(record, LogwrtResult.Flush))
{
- /* if something was added to log cache then try to flush this too */
- if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
- {
- XLogCtlInsert *Insert = &XLogCtl->Insert;
- uint32 freespace = INSERT_FREESPACE(Insert);
-
- if (freespace < SizeOfXLogRecord) /* buffer is full */
- WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
- else
- {
- WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
- WriteRqstPtr.xrecoff -= freespace;
- }
- LWLockRelease(WALInsertLock);
- }
/* now wait for the write lock */
LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
LogwrtResult = XLogCtl->Write.LogwrtResult;
if (!XLByteLE(record, LogwrtResult.Flush))
{
! WriteRqst.Write = WriteRqstPtr;
! WriteRqst.Flush = record;
XLogWrite(WriteRqst);
}
LWLockRelease(WALWriteLock);
--- 1252,1284 ----
/* done already? */
if (!XLByteLE(record, LogwrtResult.Flush))
{
/* now wait for the write lock */
LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
LogwrtResult = XLogCtl->Write.LogwrtResult;
if (!XLByteLE(record, LogwrtResult.Flush))
{
! /* try to write/flush later additions to XLOG as well */
! if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
! {
! XLogCtlInsert *Insert = &XLogCtl->Insert;
! uint32 freespace = INSERT_FREESPACE(Insert);
!
! if (freespace < SizeOfXLogRecord) /* buffer is full */
! WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! else
! {
! WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
! WriteRqstPtr.xrecoff -= freespace;
! }
! LWLockRelease(WALInsertLock);
! WriteRqst.Write = WriteRqstPtr;
! WriteRqst.Flush = WriteRqstPtr;
! }
! else
! {
! WriteRqst.Write = WriteRqstPtr;
! WriteRqst.Flush = record;
! }
XLogWrite(WriteRqst);
}
LWLockRelease(WALWriteLock);

To test this, I made a modified version of pgbench in which each
transaction consists of a simple
insert into table_NNN values(0);
where each client thread has a separate insertion target table.
This is about the simplest transaction I could think of that would
generate a WAL record each time.

Running this modified pgbench with postmaster parameters
postmaster -i -N 120 -B 1000 --wal_buffers=250
and all other configuration settings at default, CVS tip code gives me
a pretty consistent 115-118 transactions per second for anywhere from
1 to 100 pgbench client threads. This is exactly what I expected,
since the database (including WAL file) is on a 7200 RPM SCSI drive.
The theoretical maximum rate of sync'd writes to the WAL file is
therefore 120 per second (one per disk revolution), but we lose a little
because once in awhile the disk has to seek to a data file.

Inserting the above patch, and keeping all else the same, I get:

$ mybench -c 1 -t 10000 bench1
number of clients: 1
number of transactions per client: 10000
number of transactions actually processed: 10000/10000
tps = 116.694205 (including connections establishing)
tps = 116.722648 (excluding connections establishing)

$ mybench -c 5 -t 2000 -S -n bench1
number of clients: 5
number of transactions per client: 2000
number of transactions actually processed: 10000/10000
tps = 282.808341 (including connections establishing)
tps = 283.656898 (excluding connections establishing)

$ mybench -c 10 -t 1000 bench1
number of clients: 10
number of transactions per client: 1000
number of transactions actually processed: 10000/10000
tps = 443.131083 (including connections establishing)
tps = 447.406534 (excluding connections establishing)

$ mybench -c 50 -t 200 bench1
number of clients: 50
number of transactions per client: 200
number of transactions actually processed: 10000/10000
tps = 416.154173 (including connections establishing)
tps = 436.748642 (excluding connections establishing)

$ mybench -c 100 -t 100 bench1
number of clients: 100
number of transactions per client: 100
number of transactions actually processed: 10000/10000
tps = 336.449110 (including connections establishing)
tps = 405.174237 (excluding connections establishing)

CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients
to <10% idle at 10 or more.

So this does seem to be a nice win, and unless I hear objections
I will apply it ...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Copeland 2002-10-06 23:35:12 Re: Analysis of ganged WAL writes
Previous Message Alvaro Herrera 2002-10-06 21:58:04 Naming convention