Re: SQL - Indexing for performance on uniquness check...

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: "Loftis, Charles E" <charles(dot)loftis(at)eds(dot)com>, pgsql-novice(at)postgresql(dot)org
Subject: Re: SQL - Indexing for performance on uniquness check...
Date: 2004-07-18 20:09:23
Message-ID: 200407181309.23862.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Charles,

> Assume the table has more attributes than those attributes being GROUPed
> on. Also, assume all attributes are of type varchar.
>
> Sample query to return non-uniqueness
> SELECT A1, A2, A3, ..., An
> FROM Table
> GROUP BY A1, A2, A3, ..., An
> HAVING Count(*)>1

In order for it to be even possible to use an index (a hashaggregate
operation, actually) on this table, you'd have to include *all* of the GROUP
BY columns in a single, multi-column index.

However, it would be unlikely for PG to use any kind of an index in the
operation above, because of the number of columns, the unlikelyness of
grouping (i.e. there will only be a minority of rows with count(*) > 1) and
the fact that you're running this against the whole table. So any kind of an
index is liable to be useless.

If the table is so large that you *have* to use an index or it takes
absolutely forever to run, then you may wish to try different operations to
detect duplicate rows. For example, if it could be assumed that there was a
column Ax which was not included as a unique identifier, and could be counted
on to be both (a) not null and (b) different for duplicate rows (a timestamp
for example), then you could do:

SELECT A1, A2, A3, ... An
FROM table
WHERE EXISTS (SELECT 1
FROM table t2
WHERE t2.A1 = table.A1
AND t2.A2 = table.A2
...
AND t2.An = table.An
AND t2.Ax <> table.Ax
);

This can be a much better structure for indexed searches because an index on
some-but-not-all of the columns A1 ... An can be used for the EXISTS join,
such as an index on A1, A2, A3.

Of course, if this is an import table, where some of the rows are *exact*
duplicates, the above doesn't help. On the other hand, if the rows *are*
exact duplicates, why do you care? Just do SELECT DISTINCT on transfer and
eliminate them.

--
Josh Berkus
Aglio Database Solutions
San Francisco

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Oliver Elphick 2004-07-18 22:13:48 Re: Function compile error
Previous Message Loftis, Charles E 2004-07-18 19:20:05 SQL - Indexing for performance on uniquness check...