Skip site navigation (1) Skip section navigation (2)

Re: SQL - Indexing for performance on uniquness check...

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: "Loftis, Charles E" <charles(dot)loftis(at)eds(dot)com>,pgsql-novice(at)postgresql(dot)org
Subject: Re: SQL - Indexing for performance on uniquness check...
Date: 2004-07-18 20:09:23
Message-ID: 200407181309.23862.josh@agliodbs.com (view raw or flat)
Thread:
Lists: pgsql-novice
Charles,

> Assume the table has more attributes than those attributes being GROUPed
> on. Also, assume all attributes are of type varchar.
>
> Sample query to return non-uniqueness
>    SELECT A1, A2, A3, ..., An
>    FROM Table
>    GROUP BY A1, A2, A3, ..., An
>    HAVING Count(*)>1

In order for it to be even possible to use an index (a hashaggregate 
operation, actually) on this table, you'd have to include *all* of the GROUP 
BY columns in a single, multi-column index.

However, it would be unlikely for PG to use any kind of an index in the 
operation above, because of the number of columns, the unlikelyness of 
grouping (i.e. there will only be a minority of rows with count(*) > 1) and 
the fact that you're running this against the whole table.  So any kind of an 
index is liable to be useless.

If the table is so large that you *have* to use an index or it takes 
absolutely forever to run, then you may wish to try different operations to 
detect duplicate rows.   For example, if it could be assumed that there was a 
column Ax which was not included as a unique identifier, and could be counted 
on to be both (a) not null and (b) different for duplicate rows (a timestamp 
for example), then you could do:

SELECT A1, A2, A3, ... An
FROM table
WHERE EXISTS (SELECT 1
	FROM table t2 
	WHERE t2.A1 = table.A1
		AND t2.A2 = table.A2
		...
		AND t2.An = table.An
		AND t2.Ax <> table.Ax
	);

This can be a much better structure for indexed searches because an index on 
some-but-not-all of the columns A1 ... An can be used for the EXISTS join, 
such as an index on A1, A2, A3.   

Of course, if this is an import table, where some of the rows are *exact* 
duplicates, the above doesn't help.   On the other hand, if the rows *are* 
exact duplicates, why do you care?   Just do SELECT DISTINCT on transfer and 
eliminate them.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

In response to

Responses

pgsql-novice by date

Next:From: Oliver ElphickDate: 2004-07-18 22:13:48
Subject: Re: Function compile error
Previous:From: Loftis, Charles EDate: 2004-07-18 19:20:05
Subject: SQL - Indexing for performance on uniquness check...

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group