Re: WIP: Covering + unique indexes.

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Covering + unique indexes.
Date: 2016-01-12 16:59:40
Message-ID: 569530FC.2010900@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

08.01.2016 00:12, David Rowley:
> On 7 January 2016 at 06:36, Jeff Janes <jeff(dot)janes(at)gmail(dot)com
> <mailto:jeff(dot)janes(at)gmail(dot)com>> wrote:
>
> On Tue, Jan 5, 2016 at 11:55 PM, David Rowley
> <david(dot)rowley(at)2ndquadrant(dot)com
> <mailto:david(dot)rowley(at)2ndquadrant(dot)com>> wrote:
> > create table ab (a int,b int);
> > insert into ab select x,y from generate_series(1,20) x(x),
> > generate_series(10,1,-1) y(y);
> > create index on ab (a) including (b);
> > explain select * from ab order by a,b;
> > QUERY PLAN
> > ----------------------------------------------------------
> > Sort (cost=10.64..11.14 rows=200 width=8)
> > Sort Key: a, b
> > -> Seq Scan on ab (cost=0.00..3.00 rows=200 width=8)
> > (3 rows)
>
> If you set enable_sort=off, then you get the index-only scan with no
> sort. So it believes the index can be used for ordering (correctly, I
> think), just sometimes it thinks it is not faster to do it that way.
>
> I'm not sure why this would be a correctness problem. The covered
> column does not participate in uniqueness checks, but it still usually
> participates in index ordering. (That is why dummy op-classes are
> needed if you want to include non-sortable-type columns as being
> covered.)
>
>
> If that's the case, then it appears that I've misunderstood INCLUDING.
> From reading _bt_doinsert() it appeared that it'll ignore the
> INCLUDING columns and just find the insert position based on the key
> columns. Yet that's not the way that it appears to work. I was also a
> bit confused, as from working with another database which has very
> similar syntax to this, that one only includes the columns to allow
> index only scans, and the included columns are not indexed, therefore
> can't be part of index quals and the index only provides a sorted path
> for the indexed columns, and not the included columns.

Thank you for properly testing. Order by clause in this case definitely
doesn't work as expected.
The problem is fixed by patching a planner function
"build_index_pathkeys()'. It disables using of index if sorting of
included columns is required.
Test example works correctly now - it always performs seq scan and sort.

> Saying that, I'm now a bit confused to why the following does not
> produce 2 indexes which are the same size:
>
> create table t1 (a int, b text);
> insert into t1 select x,md5(random()::text) from
> generate_series(1,1000000) x(x);
> create index t1_a_inc_b_idx on t1 (a) including (b);
> create index t1_a_b_idx on t1 (a,b);
> select pg_relation_Size('t1_a_b_idx'),pg_relation_size('t1_a_inc_b_idx');
> pg_relation_size | pg_relation_size
> ------------------+------------------
> 59064320 | 58744832
> (1 row)

I suppose you've already found that in discussion above. Included
columns are stored only in leaf index pages. The difference is the size
of attributes 'b' which are situatedin inner pages of index "t1_a_b_idx".

> Also, if we want INCLUDING() to mean "uniqueness is not enforced on
> these columns, but they're still in the index", then I don't really
> think allowing types without a btree opclass is a good idea. It's
> likely too surprised filled and might not be what the user actually
> wants. I'd suggest that these non-indexed columns would be better
> defined by further expanding the syntax, the first (perhaps not very
> good) thing that comes to mind is:
>
> create unique index idx_name on table (unique_col) also index
> (other,idx,cols) including (leaf,onlycols);
>
> Looking up thread, I don't think I was the first to be confused by this.

Included columns are still in the index physically - they are stored in
the index relation. But they are not indexedin the true sense of the
word. It's impossible to use them for index scan or ordering. At the
beginning, I've got an idea that included columns are supposed to be
used for combination of unique index on one columns and covering on
others. In a very rare instances one could prefer a non-unique index
with included columns "t1_a_inc_b_idx"to a regular multicolumn index
"t1_a_b_idx". Frankly, I didn't see such use cases at all. Index size
reduction is not considerable, while we lose some useful index
functionality on included column. I think that it should be mentioned as
a note in documentation, but I need help to phrase it clear.

But now I see the reason to create non-unique index with included
columns - lack of suitable opclass on column "b".
It's impossible to add it into the index as a key column, but that's not
a problem with INCLUDING clause.
Look at example.

create table t1 (a int, b box);
create index t1_a_inc_b_idx on t1 (a) including (b);
create index on tbl (a,b);
ERROR: data type box has no default operator class for access method
"btree"
HINT: You must specify an operator class for the index or define a
default operator class for the data type.
create index on tbl (a) including (b);
CREATE INDEX

This functionality is provided by the attached patch "omit_opclass_4.0",
which must be applied over covering_unique_4.0.patch.

I see what you were confused about, I'd had the same question at the
very beginning of the discussion of this patch.
Now it seems a bit more clear to me. INCLUDING columns are not used for
the searching or ordering of records, so there is no need to check
whether they have an opclass. INCLUDING columns perform as expected and
it agrees with other database experience. And this patch is completed.

But it isn't perfect definitely... I found test case to explain that.
See below.
That's why we need optional_opclass functionality, which will use
opclass where possible and omit it in other cases.
This idea have been already described in a message Re: [PROPOSAL]
Covering + unique indexes
<http://www.postgresql.org/message-id/55F84DF4.5030207@postgrespro.ru>as
"partially unique index".
I suggest to separate optional_opclass task to ease syntax discussion
and following review. And I'll implement it in the next patch a bit later.

Test case:
1) patch covering_unique_4.0 + test_covering_unique_4.0
If included columns' opclasses are used, new query plan is the same with
the old one.
and have nearly the same execution time:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using oldcoveringidx on oldt (cost=0.43..301.72
rows=1 width=8) (actual time=0.021..0.676 rows=6 loops=1)
Index Cond: ((c1 < 10000) AND (c3 < 20))
Heap Fetches: 0
Planning time: 0.101 ms
Execution time: 0.697 ms
(5 rows)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Only Scan using newidx on newt (cost=0.43..276.51 rows=1
width=8) (actual time=0.020..0.665 rows=6 loops=1)
Index Cond: ((c1 < 10000) AND (c3 < 20))
Heap Fetches: 0
Planning time: 0.082 ms
Execution time: 0.687 ms
(5 rows)

2) patch covering_unique_4.0 + patch omit_opclass_4.0 +
test_covering_unique_4.0
Otherwise, new query can not use included column in Index Cond and uses
filter instead. It slows down the query significantly.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using oldcoveringidx on oldt (cost=0.43..230.39
rows=1 width=8) (actual time=0.021..0.722 rows=6 loops=1)
Index Cond: ((c1 < 10000) AND (c3 < 20))
Heap Fetches: 0
Planning time: 0.091 ms
Execution time: 0.744 ms
(5 rows)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Only Scan using newidx on newt (cost=0.43..374.68 rows=1
width=8) (actual time=0.018..2.595 rows=6 loops=1)
Index Cond: (c1 < 10000)
Filter: (c3 < 20)
Rows Removed by Filter: 9993
Heap Fetches: 0
Planning time: 0.078 ms
Execution time: 2.612 ms

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
covering_unique_4.0.patch text/x-patch 44.7 KB
omit_opclass_4.0.patch text/x-patch 24.0 KB
test_covering_4.0.sql application/sql 3.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2016-01-12 17:05:32 Re: proposal: PL/Pythonu - function ereport
Previous Message Catalin Iacob 2016-01-12 16:59:30 Re: proposal: PL/Pythonu - function ereport