Re: Removing useless DISTINCT clauses

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Removing useless DISTINCT clauses
Date: 2018-04-07 20:55:36
Message-ID: 4779.1523134536@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> My gripe about
> this as it stands is mainly that it seems like it's going to do
> a lot of catalog-lookup work twice over, at least in cases that
> have both DISTINCT and GROUP BY --- or is that too small a set to
> worry about?

I convinced myself that that was a silly objection, and started looking
through the patch with intent to commit. However, re-reading the thread
re-awakened my concern about whether deleting items from the
distinctClause is actually safe.

You're right that remove_unused_subquery_outputs isn't really in trouble
with this; it might fail to remove some things it could remove, but that
seems OK, and at best deserving of a comment.

However, that doesn't mean we're out of the woods. See also
preprocess_groupclause, particularly this comment:
* Note: we need no comparable processing of the distinctClause because
* the parser already enforced that that matches ORDER BY.

as well as the interesting assumptions in create_distinct_paths about
what it means to compare the lengths of the pathkey lists for these
clauses. Once I noticed that, I became convinced that this patch actually
breaks planning of sort/unique-based DISTINCT: if we have a pkey a,b,
but the ORDER BY list is a,c,b, then we will sort by a,c,b but the
Unique node will be told it only needs to unique-ify on the first two
sort columns.

That led me to try this test case, and it blew up even better than
I expected:

regression=# create table t1 (a int,b int, c int, d int, primary key(a,b));
CREATE TABLE
regression=# explain verbose select distinct * from t1 order by a,c,b;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

The reason is we're hitting this assert, a bit further down in
create_distinct_paths:

/* Assert checks that parser didn't mess up... */
Assert(pathkeys_contained_in(root->distinct_pathkeys,
needed_pathkeys));

I don't think that that makes this patch unsalvageable. The way forward
probably involves remembering that we removed some distinctClause items
(so that we know the correspondence to the sortClause no longer holds)
and then working harder in create_distinct_paths when that's the case.

However, that's not something that's going to happen on the last day
of the commitfest. So I'm going to mark this Returned With Feedback
and encourage you to return to the matter in v12.

In the meantime, attached is the version of the patch that I was about to
commit before getting cold feet. It has some cosmetic improvements
over yours, notably comment additions in allpaths.c.

regards, tom lane

Attachment Content-Type Size
remove_useless_distinct_clauses_v6.patch text/x-diff 17.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2018-04-07 20:56:44 Re: WIP: Covering + unique indexes.
Previous Message Teodor Sigaev 2018-04-07 20:52:42 Re: WIP: Covering + unique indexes.