Re: DISTINCT with btree skip scan

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Thomas Munro <munro(at)ip9(dot)org>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: DISTINCT with btree skip scan
Date: 2014-10-27 07:24:24
Message-ID: CAApHDvow8-DokZExZwHB_rJ9euT9NGgxX7+Zuy0AKHpcueF1Qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jul 5, 2014 at 12:17 PM, Thomas Munro <munro(at)ip9(dot)org> wrote:

> postgres=# set enable_hashagg = false;
> SET
> Time: 0.302 ms
> postgres=# explain select distinct a from foo;
>
> ┌─────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │ QUERY PLAN
> │
>
> ├─────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Index Only Scan for distinct prefix 1 using foo_pkey on foo
> (cost=0.43..263354.20 rows=10 width=4) │
> │ Planning time: 0.063 ms
> │
>
> └─────────────────────────────────────────────────────────────────────────────────────────────────────┘
> (2 rows)
>
> Time: 0.443 ms
> postgres=# select distinct a from foo;
> ┌───┐
> │ a │
> ├───┤
> │ 0 │
> │ 1 │
> │ 2 │
> │ 3 │
> │ 4 │
> │ 5 │
> │ 6 │
> │ 7 │
> │ 8 │
> │ 9 │
> └───┘
> (10 rows)
>
> Time: 0.565 ms
>
>
>
Hi Thomas,

I've had a quick look at this and it seems like a great win! I'm quite
surprised that we've not got this already. I think this technology could
also really help performance of queries such as SELECT * from bigtable bt
WHERE EXISTS(SELECT 1 FROM otherbigtable obt WHERE bt.somecol =
obt.someIndexedButNonUniqueColumn); I know you're not proposing to improve
that first off, but it could be done later once the benefits of this are
more commonly realised.

I think our shortfalls in this area have not gone unnoticed. I was reading
this post
https://www.periscope.io/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
about comparing performance of COUNT(DISTINCT col). I think this would give
a big win for query 3 in that post. I'm trying to find some time make some
changes to transform queries to allow the group by to happen before the
joins when possible, so between that and your patch we'd be 2 steps closer
to making query 1 in the link above perform a little better on PostgreSQL.

Do you think you'll manage to get time to look at this a bit more? I'd be
keen to look into the costing side of it if you think that'll help you any?

Regards

David Rowley

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2014-10-27 07:52:27 Re: On partitioning
Previous Message Ali Akbar 2014-10-27 07:12:31 Re: Function array_agg(array)