Re: big distinct clause vs. group by

From: Uwe Bartels <uwe(dot)bartels(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: big distinct clause vs. group by
Date: 2011-04-18 17:13:28
Message-ID: BANLkTimPgtYKLyEE8K0NKoE=DAvgm7p8+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Robert,

thanks for your answer.
the aggregate function I was talking about is the function I need to use for
the non-group by columns like min() in my example.
There are of course several function to choose from, and I wanted to know
which causes as less as possible resources.

best regards,
Uwe

On 18 April 2011 18:19, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Mar 16, 2011 at 4:45 AM, Uwe Bartels <uwe(dot)bartels(at)gmail(dot)com>
> wrote:
> > I'm having trouble with some sql statements which use an expression with
> > many columns and distinct in the column list of the select.
> > select distinct col1,col2,.....col20,col21
> > from table1 left join table2 on <join condition>,...
> > where
> > <other expressions>;
> >
> > The negative result is a big sort with teporary files.
> > -> Sort (cost=5813649.93..5853067.63 rows=15767078
> width=80)
> > (actual time=79027.079..81556.059 rows=12076838 loops=1)
> > Sort Method: external sort Disk: 1086096kB
> > By the way - for this query I have a work_mem of 1 GB - so raising this
> > further is not generally possible - also not for one special command, due
> to
> > parallelism.
> >
> > How do I get around this?
>
> Hmm. It seems to me that there's no way to work out the distinct
> values without either sorting or hashing the output, which will
> necessarily be slow if you have a lot of data.
>
> > I have one idea and like to know if there any other approaches or an even
> > known better solution to that problem. By using group by I don't need the
> > big sort for the distinct - I reduce it (theoreticly) to the key columns.
> >
> > select <list of key columns>,<non key column>
> > from tables1left join table2 on <join condition>,...
> > where
> > <other conditions>
> > group by <list of key columns>
>
> You might try SELECT DISTINCT ON (key columns) <key columns> <non-key
> columns> FROM ...
>
> > Another question would be what's the aggregate function which needs as
> less
> > as possible resources (time).
>
> Not sure I follow this part.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message John Rouillard 2011-04-18 18:12:41 Assessing performance of fetches
Previous Message Robert Haas 2011-04-18 16:51:59 Re: Adding additional index causes 20,000x slowdown for certain select queries - postgres 9.0.3