Re: Seq scans roadmap

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Simon Riggs <simon(at)enterprisedb(dot)com>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, CK Tan <cktan(at)greenplum(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-15 09:32:20
Message-ID: 46497E24.6060500@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Just to keep you guys informed, I've been busy testing and pondering
over different buffer ring strategies for vacuum, seqscans and copy.
Here's what I'm going to do:

Use a fixed size ring. Fixed as in doesn't change after the ring is
initialized, however different kinds of scans use differently sized rings.

I said earlier that it'd be invasive change to see if a buffer needs a
WAL flush and choose another victim if that's the case. I looked at it
again and found a pretty clean way of doing that, so I took that
approach for seq scans.

1. For VACUUM, use a ring of 32 buffers. 32 buffers is small enough to
give the L2 cache benefits and keep cache pollution low, but at the same
time it's large enough that it keeps the need to WAL flush reasonable
(1/32 of what we do now).

2. For sequential scans, also use a ring of 32 buffers, but whenever a
buffer in the ring would need a WAL flush to recycle, we throw it out of
the buffer ring instead. On read-only scans (and scans that only update
hint bit) this gives the L2 cache benefits and doesn't pollute the
buffer cache. On bulk updates, it's effectively the current behavior. On
scans that do some updates, it's something in between. In all cases it
should be no worse than what we have now. 32 buffers should be large
enough to leave a "cache trail" for Jeff's synchronized scans to work.

3. For COPY that doesn't write WAL, use the same strategy as for
sequential scans. This keeps the cache pollution low and gives the L2
cache benefits.

4. For COPY that writes WAL, use a large ring of 2048-4096 buffers. We
want to use a ring that can accommodate 1 WAL segment worth of data, to
avoid having to do any extra WAL flushes, and the WAL segment size is
2048 pages in the default configuration.

Some alternatives I considered but rejected:

* Instead of throwing away dirtied buffers in seq scans, accumulate them
in another fixed sized list. When the list gets full, do a WAL flush and
put them to the shared freelist or a backend-private freelist. That
would eliminate the cache pollution of bulk DELETEs and bulk UPDATEs,
and it could be used for vacuum as well. I think this would be the
optimal algorithm but I don't feel like inventing something that
complicated at this stage anymore. Maybe for 8.4.

* Using a different sized ring for 1st and 2nd vacuum phase. Decided
that it's not worth the trouble, the above is already an order of
magnitude better than the current behavior.

I'm going to rerun the performance tests I ran earlier with new patch,
tidy it up a bit, and submit it in the next few days. This turned out to
be even more laborious patch to review than I thought. While the patch
is short and in the end turned out to be very close to Simon's original
patch, there's many different usage scenarios that need to be catered
for and tested.

I still need to check the interaction with Jeff's patch. This is close
enough to Simon's original patch that I believe the results of the tests
Jeff ran earlier are still valid.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Luke Lonergan 2007-05-15 09:39:17 Re: Seq scans roadmap
Previous Message Russell Smith 2007-05-15 08:50:22 Re: Removing pg_auth_members.grantor (was Grantor name gets lost when grantor role dropped)