Re: Batch update of indexes

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Batch update of indexes
Date: 2016-01-26 15:13:26
Message-ID: 56A78D16.8010607@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I have implemented "ALTER INDEX ... WHERE ..." clause allowing to change
condition for partial index.
Actually it allows us to append index without fully rebuilding it.
As I explained in the previous mails, partial indexes can be used to
increase insert speed.
Right now I get the following results (with one insert stream):

Insert with 1 index (primary key, monotonically ascending):
324275 TPS
Insert with 9 indexes (primary key + 8 indexes with random keys):
52495 TPS
Insert with primary key and 8 concurrently updated partial indexes:
194458 TPS
Insert with primary key and 8 "frozen" partial
indexes: 278446 TPS

So, as you can see insert with indexes is about 6 times slower than
insert without indexes.
And partial indexes allows to eliminate this gap.
When partial indexes are not affected (assuming that them will be
reconstructed "at night"),
performance is almost the same, as without indexes.
And if "ALTER INDEX" is done concurrently with inserts, it certainly
decrease insert speed,
but still it is 4 times faster than with normal indexes.

Such high TPS values were obtained using "insert from select" to bypass
libpq overhead.
With libpq (when each insert is sent as independent statement) results
are less impressive:

Insert with 1 index (primary key, monotonically ascending):
37892 TPS
Insert with 9 indexes (primary key + 8 indexes with random keys):
20231 TPS
Insert with primary key and 8 concurrently updated partial indexes:
26934 TPS
Insert with primary key and 8 "frozen" partial
indexes: 28863 TPS

But still partial indexes allows to almost eliminate two times
differences...

This results can be reproduced using our public repository:
https://github.com/postgrespro/postgres_cluster

Most of the code related with support of "ALTER INDEX .. WHERE" is in
AlterIndex function in
postgres_cluster/src/backend/commands/indexcmds.c
I have also added insbench utility for measuring insert performance,
using which this results were obtained.
It is located in postgres_cluster/src/bin/insbench directory.

Known issues:
1. I do not handle case when new condition for partial index is more
restricted than original.
There is no way in Postgres to exclude records from index (except
VACUUM), so in this case index has to be reconstructed from scratch.
2. Currently I am using SPI to locate records which should be included
in index.
3. I am not completely sure that there are no
synchronization/isolation problems in AlterIndex function

If this approach is considered to be interesting by community, I will
try to address these issues.

On 20.01.2016 12:28, Konstantin Knizhnik wrote:
> Hi hackers,
>
> I want to know opinion of community about possible ways of solving
> quite common problem: increasing insert speed while still providing
> indexes for efficient execution of queries.
>
> Many applications have to deal with high input stream of data. Most of
> the time while record inserting in the database is taken for update of
> indexes. And without indexes we are not able to efficiently execute
> queries.
> So in many cases it is desirable to have "batch or concurrent" index
> update. And it is acceptable that an index is slightly behind current
> state of the table.
>
> One interesting approach of solving this problem is discussed in this
> article:
>
> https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb
>
> Them are using materialized views to build indexes in background.
> Interesting idea, but copying content of the whole table just to be
> able to build index concurrently seems to be overkill.
>
> I thought about more straightforward ways of solving this problem. It
> will be nice if we can preserve of of them main postulates of Postgres
> and other RDBMSes: indexes are just optimization and result of query
> should not depend on presence of indexes.
>
> First idea is to use inheritance. I have investigated different ways
> of splitting table into "archival" and "operational" parts, but all of
> them requiring physical copying of data from one table to another.
>
> Another idea is to use partial indexes
> (http://www.postgresql.org/docs/current/static/indexes-partial.html)
> Assume that we have stream of input data where each record have
> increased timestamp:
>
> create table t(
> ts timestamp primary key,
> c1 real,
> c2 integer,
> c3 varchar,
> ...
> cN char(5)
> );
>
> We want to provide the highest insert speed for "t" but provide
> indexes for c1..cN fields.
> We can declared partial indexes:
>
> create index idx1 on t(c1) where ts < '20/01/2016';
> create index idx2 on t(c2) where ts < '20/01/2016';
> ...
> create index idxN on t(cN) where ts < '20/01/2016';
>
> As far as this indexes do not cover current date, them will not be
> affected during insert operations.
> But we can still efficiently run queries like
>
> select * from t where c1>100 and ts < '20/01/2016';
>
> Then, in background, may be at night, we can do
>
> alter index idx1 where ts < '21/01/2016';
>
> Please notice that such alter table statement, changing condition for
> partial index, is not supported now.
> But I do not see any principle problems with supporting such construction.
> We should just include in the index all records which match new
> condition and do not match old condition:
>
> ts < '21/01/2016' and not (ts < '20/01/2016')
>
> If there is index for "ts" field it can be done quite efficiently.
> This approach doesn't cause contradictions with concepts of indexes in
> RDBMS.
>
> But there is one more problem with this approach with I think should
> be addressed.
> Right now optimizer builds the following execution plan for query with
> partial indexes:
>
> postgres=# explain select * from t where c1 < 10 and ts <
> '20/01/2016'::timestamp;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on t (cost=7.20..732.14 rows=12263 width=12)
> Recheck Cond: ((c1 < '10'::double precision) AND (ts < '2016-01-20
> 00:00:00'::timestamp without time zone))
> -> Bitmap Index Scan on idx1 (cost=0.00..4.13 rows=12263 width=0)
> Index Cond: (c1 < '10'::double precision)
> (4 rows)
>
> As you can see optimizer insert recheck in query execution plan while
> it is not needed in this case: search condition is exactly the same as
> partial index condition.
> Optimal plan should be:
>
> Index Scan using idx1 on t (cost=0.00..4.13 rows=12263 width=0)
> Index Cond: (c1 < '10'::double precision)
>
>
> What do you think about this approach? Will it be useful to work in
> this direction?
> Or there are some better solutions for the problem?
>
> --
> Konstantin Knizhnik
> Postgres Professional:http://www.postgrespro.com
> The Russian Postgres Company

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-01-26 15:43:18 Re: pglogical - logical replication contrib module
Previous Message Yury Zhuravlev 2016-01-26 15:04:13 Proposal:Use PGDLLEXPORT for libpq