Quick Links

Re: Estimating costs (was Functional Indices)

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, kavoos <kavoos(at)issn(dot)org>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Estimating costs (was Functional Indices)
Date:	2001-05-23 15:55:00
Message-ID:	20010524015500.A26920@kleptog.nth
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Wed, May 23, 2001 at 10:48:35AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > But I know something that postgres doesn't. The data is clustered somewhat
> > around the id we're comparing on. There is a "runlength" involved. Thus,
> > when doing an index scan, once it has read in the first tuple of a run there
> > will be more just sitting in the cache at basically zero cost.
>
> Hm. There is code in development sources that will notice data
> clustering --- actually what it looks at is the correlation between
> physical tuple order and logical order of the values. I'm not sure
> whether it would do the right thing if you have runs of identical keys
> but no overall ordering of the keys, however.

That wouldn't help much in this case as outside of the actual grouping, the
keys are not sorted against eachother.

> > Currently I work around this by fiddling enable_seqscan is strategic places
> > but that's blunt instrument.
>
> Yup, it sure is. Can you propose a statistical measurement that would
> cope with this situation?

I was thinking "average runlength". If this were 10 for example, when it
came to calculating the cost of the index scan, it would divide the
per-tuple cost by 10.

You can go the simple calculation method which would count the number of
times the value in a column was different than the previous value, then
divide that into the total number of tuples. That's not difficult to
implement. I was actually thinking of editting the ANALYZE code to calculate
this value but after looking at the code I decided I would need to study the
codebase a bit first.

There is another way which would allow you to calculate a value that you can
tune. It works something like this (this is for only one column):

counter = 1
lastval = tuple[1].value

foreach remaining tuple
if tuple.value != lastval
histogram[counter]++
lastval = tuple.value
counter = 1
else
counter++

You than have a histogram of runlengths. Then, from the highest runlength to
the lowest, do a cumulative sum of number of tuples compared against the
total.

You can choose where to take the value at the 50% mark, 80% mark or even
95%.

Note that I have absolutly no theory to base this on, only that it seems
silly not to take advantage of any clustering present in the data.

For a perl script the calculate the histogram, go to:
http://svana.org/kleptog/getcoherency.pl

Use like: ./getcoherency dbname tablename fieldnames... > output

For example output, go to:
http://svana.org/kleptog/output.txt

Hope this helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
- Every night you must rediscover the secret to sleep,
- only to forget it moments later...

In response to

Re: Estimating costs (was Functional Indices) at 2001-05-23 14:48:35 from Tom Lane

Responses

Re: Estimating costs (was Functional Indices) at 2001-05-23 17:22:41 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	George Herson	2001-05-23 15:55:09	Re: incomplete transaction keeps table locked?
Previous Message	Ian Harding	2001-05-23 15:51:16	Re: Incrementing a date type.