Re: new correlation metric

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, npboley(at)gmail(dot)com
Subject: Re: new correlation metric
Date: 2008-10-26 15:57:47
Message-ID: 4904937B.9080009@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Sun, Oct 26, 2008 at 01:38:02AM -0700, Jeff Davis wrote:
>> I worked with Nathan Boley to come up with what we think is a better
>> metric for measuring this cost. It is based on the number of times in
>> the ordered sample that you have to physically backtrack (i.e. the data
>> value increases, but the physical position is earlier).
>>
>> For example, if the table's physical order is
>>
>> 6 7 8 9 10 1 2 3 4 5
>
> How does it deal with a case like the following:
>
> 1 6 2 7 3 8 4 9 5 10 (interleaving)
>
> ISTM that your code will overestimate the cost whereas the old code
> wouldn't have done too badly.

Another similar case is where the tables is almost in order, but not
quite. For example, take an ordered table, but swap every other value:

2 1 4 3 6 5 7 8 10 9

Distributions similar to that that would happen naturully if you have a
logging application that receives timestamped events from elsewhere. The
events would be come in roughly in timestamp order, but not quite
because of different delays, for example. For all practical purposes,
that is just as good as a perfectly sorted table.

However, the patch actually operates on the *sample*, not the real
physical order. So I think it would actually work well for my example,
because if you take just a few values from that sequence in random, the
sample is likely to be in good order. Not sure about the interleaved case.

I wonder how the proposed measurement behaves wrt. the sample size vs.
table size ratio. Intuitively, it feels like it becomes less sensitive
to disorder, as the table size grows and (sample size)/(table size)
becomes smaller. Which is not good, I believe.

> I think the code is in the right direction, but I think want you want
> is some kind of estimate of "given I've looked for tuple X, how many
> tuples in the next k pages are near this one". Unfortunatly I don't see
> a way of calculating it other than a full simulation.

Yeah, it would be nice to figure out something, as the current method
sure isn't perfect.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2008-10-26 16:03:10 Re: new correlation metric
Previous Message Tom Lane 2008-10-26 15:51:02 Re: new correlation metric