Re: Index Skip Scan

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject: Re: Index Skip Scan
Date: 2019-06-16 15:03:36
Message-ID: CA+q6zcXoYjCtyU-jFU_=p2DL_K3BHbBEEbJHaJTxdvRX3cU-Sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Fri, Jun 14, 2019 at 9:20 AM David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>
> The code in create_distinct_paths() I think should work a different
> way. I think it would be much better to add a new field to Path and
> allow a path to know what keys it is distinct for. This sort of goes
> back to an idea I thought about when developing unique joins
> (9c7f5229ad) about an easier way to detect fields that a relation is
> unique for. I've been calling these "UniqueKeys" in a few emails [1].
> The idea was to tag these onto RelOptInfo to mention which columns or
> exprs a relation is unique by so that we didn't continuously need to
> look at unique indexes in all the places that call
> relation_has_unique_index_for(). The idea there was that unique joins
> would know when a join was unable to duplicate rows. If the outer side
> of a join didn't duplicate the inner side, then the join RelOptInfo
> could keep the UniqueKeys from the inner rel, and vice-versa. If both
> didn't duplicate then the join rel would obtain the UniqueKeys from
> both sides of the join. The idea here is that this would be a better
> way to detect unique joins, and also when it came to the grouping
> planner we'd also know if the distinct or group by should be a no-op.
> DISTINCT could be skipped, and GROUP BY could do a group aggregate
> without any sort.
>
> I think these UniqueKeys ties into this work, perhaps not adding
> UniqueKeys to RelOptInfo, but just to Path so that we create paths
> that have UniqueKeys during create_index_paths() based on some
> uniquekeys that are stored in PlannerInfo, similar to how we create
> index paths in build_index_paths() by checking if the index
> has_useful_pathkeys(). Doing it this way would open up more
> opportunities to use skip scans. For example, semi-joins and
> anti-joins could make use of them if the uniquekeys covered the entire
> join condition. With this idea, the code you've added in
> create_distinct_paths() can just search for the cheapest path that has
> the correct uniquekeys, or if none exist then just do the normal
> sort->unique or hash agg. I'm not entirely certain how we'd instruct
> a semi/anti joined relation to build such paths, but that seems like a
> problem that could be dealt with when someone does the work to allow
> skip scans to be used for those.
>
> Also, I'm not entirely sure that these UniqueKeys should make use of
> PathKey since there's no real need to know about pk_opfamily,
> pk_strategy, pk_nulls_first as those all just describe how the keys
> are ordered. We just need to know if they're distinct or not. All
> that's left after removing those fields is pk_eclass, so could
> UniqueKeys just be a list of EquivalenceClass? or perhaps even a
> Bitmapset with indexes into PlannerInfo->ec_classes (that might be
> premature for not since we've not yet got
> https://commitfest.postgresql.org/23/1984/ or
> https://commitfest.postgresql.org/23/2019/ ) However, if we did use
> PathKey, that does allow us to quickly check if the UniqueKeys are
> contained within the PathKeys, since pathkeys are canonical which
> allows us just to compare their memory address to know if two are
> equal, however, if you're storing eclasses we could probably get the
> same just by comparing the address of the eclass to the pathkey's
> pk_eclass.

Interesting, thanks for sharing this.

> I also agree with James that this should not be limited to Index Only
> Scans. From testing the patch, the following seems pretty strange to
> me:
> ...
> explain analyze select distinct on (a) a,b from abc order by a,b;
> explain analyze select distinct on (a) a,b,c from abc order by a,b;
> ...

Yes, but I believe this limitation is not intrinsic to the idea of the patch,
and the very same approach can be used for IndexScan in the second example.
I've already prepared a new version to enable it for IndexScan with minimal
modifications, just need to rebase it on top of the latest changes and then
can post it. Although still there would be some limitations I guess (e.g. the
first thing I've stumbled upon is that an index scan with a filter wouldn't
work well, because checking qual causes with a filter happens after
ExecScanFetch)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2019-06-16 15:47:45 Re: Extracting only the columns needed for a query
Previous Message Bruce Momjian 2019-06-16 13:46:26 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)