Re: Batch update of indexes

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Batch update of indexes
Date: 2016-01-20 14:22:02
Message-ID: 569F980A.6080305@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

20.01.2016 12:28, Konstantin Knizhnik :
> Hi hackers,
>
> I want to know opinion of community about possible ways of solving
> quite common problem: increasing insert speed while still providing
> indexes for efficient execution of queries.
>
> Many applications have to deal with high input stream of data. Most of
> the time while record inserting in the database is taken for update of
> indexes. And without indexes we are not able to efficiently execute
> queries.
> So in many cases it is desirable to have "batch or concurrent" index
> update. And it is acceptable that an index is slightly behind current
> state of the table.

Hi, I glad to see that you interested in that too.
I think this is a good feature and I think it will be very useful to have.
I have already mentioned some related problems and possible improvements
in my presentation.
http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts
Last two slides concern to this thread. Briefly, I've suggested to think
about insertion buffer. Something very similar to it is already
implemented in BRIN. It does not index last data from heap, while the
number of last pages is less than pages_per_block.
The next point, I've thought about is a bulk update. Problem is that
update like "UPDATE mytable set a = a+1;" causes N searches from the
root of B-tree. I looks very strange to me, and I'd like to fix it
somehow. The obvious solution is to update all tuples on the page at a
time, and keep the number of last updated page. But, maybe it's a bit
off-thread here.

> One interesting approach of solving this problem is discussed in this
> article:
>
> https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb
>
> Them are using materialized views to build indexes in background.
> Interesting idea, but copying content of the whole table just to be
> able to build index concurrently seems to be overkill.

This approach seems like a tricky crutch to me. And I agree that it
requires a lot of extra work.

> I thought about more straightforward ways of solving this problem. It
> will be nice if we can preserve of of them main postulates of Postgres
> and other RDBMSes: indexes are just optimization and result of query
> should not depend on presence of indexes.
>
> First idea is to use inheritance. I have investigated different ways
> of splitting table into "archival" and "operational" parts, but all of
> them requiring physical copying of data from one table to another.
>
> Another idea is to use partial indexes
> (http://www.postgresql.org/docs/current/static/indexes-partial.html)
> Assume that we have stream of input data where each record have
> increased timestamp:
>
> create table t(
> ts timestamp primary key,
> c1 real,
> c2 integer,
> c3 varchar,
> ...
> cN char(5)
> );
>
> We want to provide the highest insert speed for "t" but provide
> indexes for c1..cN fields.
> We can declared partial indexes:
>
> create index idx1 on t(c1) where ts < '20/01/2016';
> create index idx2 on t(c2) where ts < '20/01/2016';
> ...
> create index idxN on t(cN) where ts < '20/01/2016';
>
> As far as this indexes do not cover current date, them will not be
> affected during insert operations.
> But we can still efficiently run queries like
>
> select * from t where c1>100 and ts < '20/01/2016';
>
> Then, in background, may be at night, we can do
>
> alter index idx1 where ts < '21/01/2016';

This idea sounds very interesting to me.
>
> Please notice that such alter table statement, changing condition for
> partial index, is not supported now.

Don't you think, that this feature could be used in a very wrong way? Do
not take it as criticism, just a bit of thoughts.

> But I do not see any principle problems with supporting such construction.
> We should just include in the index all records which match new
> condition and do not match old condition:
>
> ts < '21/01/2016' and not (ts < '20/01/2016')
>
> If there is index for "ts" field it can be done quite efficiently.
> This approach doesn't cause contradictions with concepts of indexes in
> RDBMS.
>
> But there is one more problem with this approach with I think should
> be addressed.
> Right now optimizer builds the following execution plan for query with
> partial indexes:
>
> postgres=# explain select * from t where c1 < 10 and ts <
> '20/01/2016'::timestamp;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on t (cost=7.20..732.14 rows=12263 width=12)
> Recheck Cond: ((c1 < '10'::double precision) AND (ts < '2016-01-20
> 00:00:00'::timestamp without time zone))
> -> Bitmap Index Scan on idx1 (cost=0.00..4.13 rows=12263 width=0)
> Index Cond: (c1 < '10'::double precision)
> (4 rows)
>
> As you can see optimizer insert recheck in query execution plan while
> it is not needed in this case: search condition is exactly the same as
> partial index condition.
> Optimal plan should be:
>
> Index Scan using idx1 on t (cost=0.00..4.13 rows=12263 width=0)
> Index Cond: (c1 < '10'::double precision)

There was the discussion of the patch for partial indexes.
http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html
Since I haven't watched it closely, It seems to be open still. I think
it'll be interesting to you.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2016-01-20 14:32:24 removal of unused argument in ginInsertCleanup()
Previous Message Fujii Masao 2016-01-20 14:17:21 Re: GIN pending list clean up exposure to SQL