From: | Martijn van Oosterhout <kleptog(at)svana(dot)org> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Peter Eisentraut <peter_e(at)gmx(dot)net>, kavoos <kavoos(at)issn(dot)org>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: Estimating costs (was Functional Indices) |
Date: | 2001-05-23 15:55:00 |
Message-ID: | 20010524015500.A26920@kleptog.nth |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Wed, May 23, 2001 at 10:48:35AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > But I know something that postgres doesn't. The data is clustered somewhat
> > around the id we're comparing on. There is a "runlength" involved. Thus,
> > when doing an index scan, once it has read in the first tuple of a run there
> > will be more just sitting in the cache at basically zero cost.
>
> Hm. There is code in development sources that will notice data
> clustering --- actually what it looks at is the correlation between
> physical tuple order and logical order of the values. I'm not sure
> whether it would do the right thing if you have runs of identical keys
> but no overall ordering of the keys, however.
That wouldn't help much in this case as outside of the actual grouping, the
keys are not sorted against eachother.
> > Currently I work around this by fiddling enable_seqscan is strategic places
> > but that's blunt instrument.
>
> Yup, it sure is. Can you propose a statistical measurement that would
> cope with this situation?
I was thinking "average runlength". If this were 10 for example, when it
came to calculating the cost of the index scan, it would divide the
per-tuple cost by 10.
You can go the simple calculation method which would count the number of
times the value in a column was different than the previous value, then
divide that into the total number of tuples. That's not difficult to
implement. I was actually thinking of editting the ANALYZE code to calculate
this value but after looking at the code I decided I would need to study the
codebase a bit first.
There is another way which would allow you to calculate a value that you can
tune. It works something like this (this is for only one column):
counter = 1
lastval = tuple[1].value
foreach remaining tuple
if tuple.value != lastval
histogram[counter]++
lastval = tuple.value
counter = 1
else
counter++
You than have a histogram of runlengths. Then, from the highest runlength to
the lowest, do a cumulative sum of number of tuples compared against the
total.
You can choose where to take the value at the 50% mark, 80% mark or even
95%.
Note that I have absolutly no theory to base this on, only that it seems
silly not to take advantage of any clustering present in the data.
For a perl script the calculate the histogram, go to:
http://svana.org/kleptog/getcoherency.pl
Use like: ./getcoherency dbname tablename fieldnames... > output
For example output, go to:
http://svana.org/kleptog/output.txt
Hope this helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
- Every night you must rediscover the secret to sleep,
- only to forget it moments later...
From | Date | Subject | |
---|---|---|---|
Next Message | George Herson | 2001-05-23 15:55:09 | Re: incomplete transaction keeps table locked? |
Previous Message | Ian Harding | 2001-05-23 15:51:16 | Re: Incrementing a date type. |