Re: [HACKERS] Secondary index access optimizations

From: Antonin Houska <ah(at)cybertec(dot)at>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Secondary index access optimizations
Date: 2018-01-09 16:48:25
Message-ID: 25337.1515516505@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

> On 14.08.2017 19:33, Konstantin Knizhnik wrote:
>
> On 14.08.2017 12:37, Konstantin Knizhnik wrote:
>
> Hi hackers,
>
> I am trying to compare different ways of optimizing work with huge append-only tables in PostgreSQL where primary key is something like timestamp and queries are usually accessing most recent data using some secondary keys. Size of secondary index is one of the most critical
> factors limiting insert/search performance. As far as data is inserted in timestamp ascending order, access to primary key is well localized and accessed tables are present in memory. But if we create secondary key for the whole table, then access to it will require random reads from
> the disk and significantly decrease performance.
>
> There are two well known solutions of the problem:
> 1. Table partitioning
> 2. Partial indexes
>
> This approaches I want to compare. First of all I want to check if optimizer is able to generate efficient query execution plan covering different time intervals.
> Unfortunately in both cases generated plan is not optimal.
>
> 1. Table partitioning:
>
> create table base (k integer primary key, v integer);
> create table part1 (check (k between 1 and 10000)) inherits (base);
> create table part2 (check (k between 10001 and 20000)) inherits (base);
> create index pi1 on part1(v);
> create index pi2 on part2(v);
> insert int part1 values (generate series(1,10000), random());
> insert into part2 values (generate_series(10001,20000), random());
> explain select * from base where k between 1 and 20000 and v = 100;
> QUERY PLAN
> -----------------------------------------------------------------------
> Append (cost=0.00..15.65 rows=3 width=8)
> -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8)
> Filter: ((k >= 1) AND (k <= 20000) AND (v = 100))
> -> Index Scan using pi1 on part1 (cost=0.29..8.31 rows=1 width=8)
> Index Cond: (v = 100)
> Filter: ((k >= 1) AND (k <= 20000))
> -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8)
> Index Cond: (v = 100)
> Filter: ((k >= 1) AND (k <= 20000))
>
> Questions:
> - Is there some way to avoid sequential scan of parent table? Yes, it is empty and so sequential scan will not take much time, but ... it still requires some additional actions and so increasing query execution time.
> - Why index scan of partition indexes includes filter condition if it is guaranteed by check constraint that all records of this partition match search predicate?
>
> 2. Partial indexes:
>
> create table t (k integer primary key, v integer);
> insert into t values (generate_series(1,20000),random());
> create index i1 on t(v) where k between 1 and 10000;
> create index i2 on t(v) where k between 10001 and 20000;
> postgres=# explain select * from t where k between 1 and 10000 and v = 100;
> QUERY PLAN
> ------------------------------------------------------------
> Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8)
> Index Cond: (v = 100)
> (2 rows)
>
> Here we get perfect plan. Let's try to extend search interval:
>
> postgres=# explain select * from t where k between 1 and 20000 and v = 100;
> QUERY PLAN
> ------------------------------------------------------------------
> Index Scan using t_pkey on t (cost=0.29..760.43 rows=1 width=8)
> Index Cond: ((k >= 1) AND (k <= 20000))
> Filter: (v = 100)
> (3 rows)
>
> Unfortunately in this case Postgres is not able to apply partial indexes.
> And this is what I expected to get:
>
> postgres=# explain select * from t where k between 1 and 10000 and v = 100 union all select * from t where k between 10001 and 20000 and v = 100;
> QUERY PLAN
> ----------------------------------------------------------------------
> Append (cost=0.29..14.58 rows=2 width=8)
> -> Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8)
> Index Cond: (v = 100)
> -> Index Scan using i2 on t t_1 (cost=0.29..7.28 rows=1 width=8)
> Index Cond: (v = 100)
>
> I wonder if there are some principle problems in supporting this two things in optimizer:
> 1. Remove search condition for primary key if it is fully satisfied by derived table check constraint.
> 2. Append index scans of several partial indexes if specified interval is covered by their conditions.
>
> I wonder if someone is familiar with this part of optimizer ad can easily fix it.
> Otherwise I am going to spend some time on solving this problems (if community think that such optimizations will be useful).
>
> Replying to myself: the following small patch removes redundant checks from index scans for derived tables:
>
> diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
> index 939045d..1f7c9cf 100644
> --- a/src/backend/optimizer/util/plancat.c
> +++ b/src/backend/optimizer/util/plancat.c
> @@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo *root,
> if (predicate_refuted_by(safe_constraints, rel->baserestrictinfo, false))
> return true;
>
> + /*
> + * Remove from restriction list items implied by table constraints
> + */
> + safe_restrictions = NULL;
> + foreach(lc, rel->baserestrictinfo)
> + {
> + RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
> + if (!predicate_implied_by(list_make1(rinfo->clause), safe_constraints, false)) {
> + safe_restrictions = lappend(safe_restrictions, rinfo);
> + }
> + }
> + rel->baserestrictinfo = safe_restrictions;
> +
> +
> return false;
> }
>
> =================================================
>
> I am not sure if this is the right place of adjusting restriction list and is such transformation always correct.
> But for the queries I have tested produced plans are correct:
>
> postgres=# explain select * from base where k between 1 and 20000 and v = 100;
> QUERY PLAN
> -----------------------------------------------------------------------
> Append (cost=0.00..15.64 rows=3 width=8)
> -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8)
> Filter: ((k >= 1) AND (k <= 20000) AND (v = 100))
> -> Index Scan using pi1 on part1 (cost=0.29..8.30 rows=1 width=8)
> Index Cond: (v = 100)
> -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8)
> Index Cond: (v = 100)
> (7 rows)
>
> postgres=# explain select * from base where k between 1 and 15000 and v = 100;
> QUERY PLAN
> -----------------------------------------------------------------------
> Append (cost=0.00..15.64 rows=3 width=8)
> -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8)
> Filter: ((k >= 1) AND (k <= 15000) AND (v = 100))
> -> Index Scan using pi1 on part1 (cost=0.29..8.30 rows=1 width=8)
> Index Cond: (v = 100)
> -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8)
> Index Cond: (v = 100)
> Filter: (k <= 15000)
> (8 rows)
>
> There is one more problem with predicate_implied_by function and standard Postgres partitions.
> Right now it specifies constrains for partitions using intervals with open high boundary:
>
> postgres=# create table bt (k integer, v integer) partition by range (k);
> postgres=# create table dt1 partition of bt for values from (1) to (10001);
> postgres=# create table dt2 partition of bt for values from (10001) to (20001);
> postgres=# create index dti1 on dt1(v);
> postgres=# create index dti2 on dt2(v);
> postgres=# insert into bt values (generate_series(1,20000), random());
>
> postgres=# \d+ dt2
> Table "public.dt2"
> Column | Type | Collation | Nullable | Default | Storage | Stats target | De
> scription
> --------+---------+-----------+----------+---------+---------+--------------+---
> ----------
> k | integer | | | | plain | |
> v | integer | | | | plain | |
> Partition of: bt FOR VALUES FROM (10001) TO (20001)
> Partition constraint: ((k IS NOT NULL) AND (k >= 10001) AND (k < 20001))
> Indexes:
> "dti2" btree (v)
>
> If now I run the query with predicate "between 1 and 20000", I still see extra check in the plan:
>
> postgres=# explain select * from bt where k between 1 and 20000 and v = 100;
> QUERY PLAN
> ----------------------------------------------------------------------
> Append (cost=0.29..15.63 rows=2 width=8)
> -> Index Scan using dti1 on dt1 (cost=0.29..8.30 rows=1 width=8)
> Index Cond: (v = 100)
> -> Index Scan using dti2 on dt2 (cost=0.29..7.33 rows=1 width=8)
> Index Cond: (v = 100)
> Filter: (k <= 20000)
> (6 rows)
>
> It is because operator_predicate_proof is not able to understand that k < 20001 and k <= 20000 are equivalent for integer type.

> If I rewrite query as (k >= 1 and k < 20001) then plan is optimal:
>
> postgres=# explain select * from bt where k >= 1 and k < 20001 and v = 100;
> QUERY PLAN
> ----------------------------------------------------------------------
> Append (cost=0.29..15.63 rows=2 width=8)
> -> Index Scan using dti1 on dt1 (cost=0.29..8.30 rows=1 width=8)
> Index Cond: (v = 100)
> -> Index Scan using dti2 on dt2 (cost=0.29..7.33 rows=1 width=8)
> Index Cond: (v = 100)
> (5 rows)
>
> It is fixed by one more patch of predtest.c:

Have you considered using the range types (internally in
operator_predicate_proof()) instead of hard-coding operator OIDs? The range
types do have the knowledge that k < 20001 and k <= 20000 are equivalent for
integer type:

postgres=# SELECT int4range '(, 20001)' = int4range '(, 20000]';
?column?
----------
t
(1 row)

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-01-09 17:37:15 Re: [HACKERS] UPDATE of partition key
Previous Message Ildar Musin 2018-01-09 16:31:02 Re: General purpose hashing func in pgbench