Batch update of indexes

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Batch update of indexes
Date: 2016-01-20 09:28:38
Message-ID: 569F5346.1010005@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I want to know opinion of community about possible ways of solving quite
common problem: increasing insert speed while still providing indexes
for efficient execution of queries.

Many applications have to deal with high input stream of data. Most of
the time while record inserting in the database is taken for update of
indexes. And without indexes we are not able to efficiently execute
queries.
So in many cases it is desirable to have "batch or concurrent" index
update. And it is acceptable that an index is slightly behind current
state of the table.

One interesting approach of solving this problem is discussed in this
article:

https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb

Them are using materialized views to build indexes in background.
Interesting idea, but copying content of the whole table just to be able
to build index concurrently seems to be overkill.

I thought about more straightforward ways of solving this problem. It
will be nice if we can preserve of of them main postulates of Postgres
and other RDBMSes: indexes are just optimization and result of query
should not depend on presence of indexes.

First idea is to use inheritance. I have investigated different ways of
splitting table into "archival" and "operational" parts, but all of them
requiring physical copying of data from one table to another.

Another idea is to use partial indexes
(http://www.postgresql.org/docs/current/static/indexes-partial.html)
Assume that we have stream of input data where each record have
increased timestamp:

create table t(
ts timestamp primary key,
c1 real,
c2 integer,
c3 varchar,
...
cN char(5)
);

We want to provide the highest insert speed for "t" but provide indexes
for c1..cN fields.
We can declared partial indexes:

create index idx1 on t(c1) where ts < '20/01/2016';
create index idx2 on t(c2) where ts < '20/01/2016';
...
create index idxN on t(cN) where ts < '20/01/2016';

As far as this indexes do not cover current date, them will not be
affected during insert operations.
But we can still efficiently run queries like

select * from t where c1>100 and ts < '20/01/2016';

Then, in background, may be at night, we can do

alter index idx1 where ts < '21/01/2016';

Please notice that such alter table statement, changing condition for
partial index, is not supported now.
But I do not see any principle problems with supporting such construction.
We should just include in the index all records which match new
condition and do not match old condition:

ts < '21/01/2016' and not (ts < '20/01/2016')

If there is index for "ts" field it can be done quite efficiently.
This approach doesn't cause contradictions with concepts of indexes in
RDBMS.

But there is one more problem with this approach with I think should be
addressed.
Right now optimizer builds the following execution plan for query with
partial indexes:

postgres=# explain select * from t where c1 < 10 and ts <
'20/01/2016'::timestamp;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=7.20..732.14 rows=12263 width=12)
Recheck Cond: ((c1 < '10'::double precision) AND (ts < '2016-01-20
00:00:00'::timestamp without time zone))
-> Bitmap Index Scan on idx1 (cost=0.00..4.13 rows=12263 width=0)
Index Cond: (c1 < '10'::double precision)
(4 rows)

As you can see optimizer insert recheck in query execution plan while it
is not needed in this case: search condition is exactly the same as
partial index condition.
Optimal plan should be:

Index Scan using idx1 on t (cost=0.00..4.13 rows=12263 width=0)
Index Cond: (c1 < '10'::double precision)

What do you think about this approach? Will it be useful to work in this
direction?
Or there are some better solutions for the problem?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2016-01-20 09:50:36 Re: Odd behavior in foreign table modification (Was: Re: Optimization for updating foreign tables in Postgres FDW)
Previous Message Pavel Stehule 2016-01-20 09:20:03 Re: Why format() adds double quote?