Re: old synchronized scan patch

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Eng" <eng(at)intranet(dot)greenplum(dot)com>
Subject: Re: old synchronized scan patch
Date: 2006-12-07 00:31:54
Message-ID: 1165451514.3839.458.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2006-12-06 at 15:12 -0500, Tom Lane wrote:

> I think all we need as far as buffer management goes is what I suggested
> before, namely have seqscans on large tables tell bufmgr not to
> increment the usage counter for their accesses. If it stays zero then
> the buffers will be recycled as soon as the sweep gets back to them,
> which is exactly what we want.

This is good, yet it addresses only the non-cache spoiling behaviour.

BTW, we would need something special to spot and Append node with
multiple SeqScans occurring on partitioned tables in sequence.
Individual scans may not be that large but overall the set could be
huge. There's not much point implementing behaviour such as "table must
be bigger than 10x shared_buffers" because it works directly against the
role of partitioning.

> The window for additional backends to
> pick up on the sync scan is then however much of shared_buffers aren't
> occupied by blocks being accessed normally.

Non-cache spoiling means window reduction, so you can't catch it by
chance.

> If we have syncscan members that are spread out over any significant
> range of the table's blocks, then the problem of setting the hint
> properly becomes a lot more pressing. We'd like incoming joiners to
> start at a fairly low block number, ie not become one of the "pack
> leaders" but one of the "trailers" --- since the higher they start,
> the more blocks they'll need to re-read at the end of their cycles,
> yet those are blocks they could have had "for free" if they'd started
> as a trailer. I don't see any cheap way to bias the behavior in that
> direction, though.

Well that's the problem. Agree about the pack leaders/trailers.

What's the solution?

You can synchronise every block or every N blocks, but otherwise: how
will you know the optimal point to start the scan? That will require
*some* synchronisation to be optimal.

Most scans don't go at the same rate naturally. Different plans do
different amounts of work between page requests. Allowing them to go
their own individual ways would be very wasteful of I/O resources, so
making some people wait for others is an essential aspect of efficiency,
just like it is with the OS.

Synchronisation costs very little in comparison with the I/O it saves.

Perhaps there are ways of doing this without central control, so that
backend error conditions don't need to be considered. Seems like a
simple timeout would be sufficient to exclude backends from the Conga,
which would be sufficient to handle error cases and special cases.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2006-12-07 00:49:14 Heading to Mexico
Previous Message Dawid Kuroczko 2006-12-07 00:20:39 Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)