Skip site navigation (1) Skip section navigation (2)

Re: Q on views and performance

From: Matthew <matthew(at)flymine(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Q on views and performance
Date: 2008-02-25 14:08:06
Message-ID: Pine.LNX.4.64.0802251345480.20402@aragorn.flymine.org (view raw or flat)
Thread:
Lists: pgsql-performance
So, this email is directed much more towards Postgres Powers That Be. I 
came across this problem a while ago, and I haven't checked whether it has 
been improved.

On Mon, 25 Feb 2008, I wrote:
>> Hi.  I'm trying to optimize...
>> 
>> (Q1)   SELECT a1.word, a2.word
>>         FROM T a1 JOIN T a2 USING ( zipk )
>>        WHERE a1.type = <int1>
>>          AND a2.type = <int2>;
>
> Create an index on T(type, zipk), and then CLUSTER on that index. That will 
> effectively group all the data for one type together and sort it by zipk, 
> making a merge join very quick indeed. I'm not sure whether Postgres will 
> notice that, but it's worth a try.

Statistics are generated on fields in a table, and the one I'm interested 
in is the correlation coefficient which tells Postgres how costly an index 
scan sorted on that field would be. This entry is ONLY useful when the 
result needs to be sorted by that exact field only. For example:

CREATE TABLE test (a int, b int);
// insert a bazillion entries
CREATE INDEX testIndex ON test(a, b);
CLUSTER test ON testIndex;
ANALYSE;

So now we have a table sorted by (a, b), but the statistics only record 
the fact that it is sorted by a, and completely unsorted by b. If we run:

SELECT * FROM test ORDER BY a;

then the query will run quickly, doing an index scan. However, if we run:

SELECT * FROM test ORDER BY a, b;

then Postgres will not be able to use the index, because it cannot tell 
how sequential the fetches from the index will be. Especially if we run:

SELECT * FROM test WHERE a = <something> ORDER BY b;

then this is the case.

So, these observations were made a long time ago, and I don't know if they 
have been improved. A while back I suggested a "partial sort" algorithm 
that could take a stream sorted by a and turn it into a stream sorted by 
(a, b) at small cost. That would fix some instances of the problem. 
However, now I suggest that the statistics are in the wrong place.

At the moment, the correlation coefficient, which is an entry purely 
designed to indicate how good an index is at index scans, is a statistic 
on the first field of the index. Why not create a correlation coefficient 
statistic for the index as a whole instead, and store it elsewhere in the 
statistics data? That way, instead of having to infer from the first field 
how correlated an index is, and getting it wrong beyond the first field, 
you can just look up the correlation for the index.

Opinions?

Matthew

-- 
If you let your happiness depend upon how somebody else feels about you,
now you have to control how somebody else feels about you. -- Abraham Hicks

In response to

pgsql-performance by date

Next:From: Nikolas EverettDate: 2008-02-25 14:11:29
Subject: Re: response time when querying via JDBC and via psql differs
Previous:From: MatthewDate: 2008-02-25 13:45:34
Subject: Re: Q on views and performance

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group