Re: Using indices for UNION.

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: andres(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Using indices for UNION.
Date: 2014-04-11 08:43:53
Message-ID: 20140411.174353.74658270.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello. Thank you for the attentive comments.

> I wrote:
> > I still think this stuff mostly needs to be thrown away and rewritten
> > in Path-creation style, but that's a long-term project. In the meantime
> > this seems like a relatively small increment of complexity, so maybe it's
> > worth applying. I'm concerned about the method for building a new
> > DISTINCT clause though; the best that can be said for that is that it's
> > a crude hack, and I'm less than convinced that there are no cases where
> > it'll dump core.
>
> OK, after still more time thinking about it, here's the issues with that
> way of generating a DISTINCT clause corresponding to the UNION:
>
> 1. Calling transformDistinctClause with a NULL pstate seems pretty unsafe.
> It might accidentally fail to fail, today, but it's surely fragile.
> It's violating the API contract not only of transformDistinctClause
> itself (which is not documented as accepting a NULL pstate) but of
> addTargetToGroupList (q.v.).

Hmm I'm ashamed to have missed addTargetToGroupList. You are
right about that. I saw only parser_errposition to field the
NULL.. It should be fixed anyway.

> A minimum requirement before I'd accept a
> patch that did that is that it extended the header comments of those
> functions to specify under what circumstances a NULL pstate can be passed.
> However, that's not the direction to go, because of #2.
>
> 2. This approach potentially changes the semantics of the UNION. (This is
> only important for a datatype that has more than one btree equality
> operator, but let's posit that.) transformDistinctClause, as per its
> header comment, allows the outer ORDER BY to influence which equality
> operator it picks for DISTINCT. However, in the case of
> "(SELECT ... UNION SELECT ...) ORDER BY ...", the UNION semantics were
> chosen without reference to the ORDER BY,

If my understanding is correct, the query is equivalent to it
without the parens. The ORDER BY belongs to the topmost UNION on
both cases. Actually planner compailns about "(SELECT...UNION
SELECT ... ORDER BY) ORDER BY" that "multiple ORDER BY clauses
not allowed" with/without this patch.

> so it's possible that the
> equality operators cited in the SetOperationStmt's groupClauses list are
> not what you'd get from applying transformDistinctClause as the patch does.

This patch flattenes and the SetOperationStmt was put out along
with its groupClause. But the groupClauses is originally
generated from targetlist in transformSetOperationTree. And the
new DistinctClause is generated also from the same targetlist
consulting to sortClause. They are not in the same order but it
doesn't matter. Is it wrong? If it would make some trouble, I
could add an check it out or count it on making the
DistinctClauses..

Finally,

explain (costs off) select a, b from c11 union select a, b from c12 order by b limit 10000;
| QUERY PLAN
| -----------------------------------------
| Limit
| -> Sort
| Sort Key: c11.b
| -> HashAggregate
| Group Key: c11.b, c11.a
| -> Append
| -> Seq Scan on c11
| -> Seq Scan on c12

This HashAggregate does DISTINCT which was performed by using
Sort-Unique without this patch, and yields the same result, I
believe.

> It is not okay for the planner to override the parser's choice of
> semantics like that.

As described above, as far as I saw, itself seems to be a problem.

> Now I'm fairly sure that planner.c is expecting that the query's
> distinctClause be a superset of the sortClause if any (cf comments for
> SortGroupClause in parsenodes.h), so it wouldn't be okay to just blindly
> build a distinctClause from topop->groupClauses.

Yes, it could be but counting the sortClause into distinctClause
is crucial factor for getting rid of unnecessary sorting.

> I think what you need
> to do is check that topop->groupClauses is compatible with the sortClause
> if any (skipping the flattening transformation if not) and then build a
> distinctClause by extending the sort clause list with any missing items
> from topop->groupClauses.

Ah, yes I agree as I described above.

> So this is sort of like what
> transformDistinctClause does, but different in detail, and the failure
> case is to not do the transformation, rather than ereport'ing. (See also
> generate_setop_grouplist, which you could almost use except that it's not
> expecting to have to merge with a sortClause list.)

Thank you. I'll follow this advise.

> So this is all doable enough, but you're probably going to end up with
> a new distinctClause-generating function that's at least twice the size
> of the patch's former net code addition. So the question is whether it's
> worth the trouble, considering that all this code has (I hope) a life
> expectancy of no more than one or two more release cycles.

Hmm. Since this is a kind of local optimization, some future
drastic reconstructing might make this useless. But it doesn't
seem the reason not to do this. (Ignoring the size..)

> I'm going to set the patch back to Waiting on Author, but unless you
> are prepared to rewrite the distinctClause creation code in the next
> couple of days, you should change it to Returned with Feedback.

Ok, I agree to the deal. This needs more refinement.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2014-04-11 08:56:58 Re: [PATCH] Negative Transition Aggregate Functions (WIP)
Previous Message Dean Rasheed 2014-04-11 08:39:57 Re: WIP patch (v2) for updatable security barrier views