Re: Batch update of indexes

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Batch update of indexes
Date: 2016-04-02 20:21:11
Message-ID: 570029B7.3020109@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/02/2016 09:57 PM, Tom Lane wrote:
> Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> writes:
>> Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
>> It is now able to handle all three possible situations:
>> 1. Making index partial (add WHERE condition to the ordinary index)
>> 2. Extend partial index range (less restricted index predicate)
>> 3. Arbitrary change of partial index predicate
> I've not been following this thread previously, but this proposal
> scares me quite a lot. I am certain there are places in our code
> that assume that the properties of an index don't change after it's
> been created. One area that this almost certainly breaks is HOT updates:
> adding a previously-unindexed column to an index predicate might break
> existing HOT chains, and I see nothing in this patch that could deal
> with that. I seem to recall there are other places that would be
> broken by changing an index's DDL definition after creation, but can't
> recall specifics right now.
>
> I am also, frankly, not seeing a use-case for this functionality that
> would justify trying to find and remove those assumptions.
>
> There's a lot of things I don't care for about the way the patch is
> written, in particular its willingness to use SPI (which opens a lot of
> potential for search-path problems, failure to see uncommitted tuples,
> etc). But we need not get to that if we don't believe the functionality
> can work.

Thank you for review, Tom.

I completely agree with all your arguments against this patch.
I have proposed this patch mostly as prove of concept.
Yes, I have not take in account hot updates and may be there are other possible issues which I not considered.

The main question is whether the proposed way of batch update of indexes is viable or it is conceptually wrong approach
(because it beaks assumption that index properties can't be changed or because it is not convenient to use...).

I hope that everybody agree that maintaining of indexes is the main limiting factor for insert speed.
If table has no indexes, then insert speed can be as high as disk write speed (100Mb/sec or 1000 for SSD).
So if size of record is about 10 bytes, then we can get about 10 millions TPS.
But presence of indexes will dramatically change this picture: if database is large enough so that even index can not fit in memory
and records are inserted in random key order, then each insert in index will require reading of 3-4 pages from random locations on the disk.
With average HDD positioning time 10 msec, we get 100 reads per second and ... 20-30 TPS. It is just with one index.
If we have 10 indexes, then TPS can be less than fingers on a hand.

Certainly it is very pessimistic estimation.
But still it is true that we can not provide good insert speed if we have to update indexes immediately.
And without indexes we can not efficiently execute most of queries.

I do not see any way in Postgres to solve this problem now. The hack with creating materialized views requires a lot of extra time and space.
It will not work for really large table.

So we need some way to postpone insertion of new records in the index. Then we can do such insertion in background or in idle time (at night), try to use bulk insert if index implementation supports it (for example sorting records by key before insert can
significantly increase locality and so improve speed of insert in index). But the principle moment here is that such delayed update of index violates the main RDBMS rule that results of query execution with and without indexes should be the same. The trick
with partial indexes allows to eliminate this contradiction. But it requires more actions from user. So are users ready to do some exatra job just because of "idealogical" reasons? Because if user wants to have delayed update of indexes, then he actually
approves that it is ok for him that query results may not include some most recent updates.

Another aspect is which database objects are allowed to be altered and which not. Right now with tables we can alter almost everything.
With indexes - almost nothing. It is assumed that index can always be reconstructed. But for very big table reconstruction of indexes from scratch will take unacceptable amount of time. So should we make it possible to alter some index characteristics
which do not require to rebuild index from scratch (and it is definitely true for partial index predicate)? Or price of supporting it is so high, that it can not be compensated by obtained benefits?

So what do you think?
1. Should I continue work in this direction and fix all possible issues with hot updates,... to make it possible to alter partial index predicates and support batch inserts i this way?
2. Or it is better to just add extra option to the index, allowing it to be slightly out-of-sync? It will allow, for example, to eliminate pending list for GIN which can cause very significant degradation of query speed, while for most full-text search
engine it is acceptable that changes are not immediately visible.
Certainly much more work is required here except of just adding a new index option...
3. Or both of the approaches are wrong and we should leave everything as it is?
4. Or may be there is some other approach which is more acceptable?

>
>> This patch includes src/bin/insbench utility for testing insert
>> performance. It can be easily excluded from the patch to reduce it size.
> C++ is not likely to get accepted into our tree ...
>
> regards, tom lane

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-04-02 21:41:48 Re: Add schema-qualified relnames in constraint error messages.
Previous Message Shulgin, Oleksandr 2016-04-02 18:57:53 Re: More stable query plans via more predictable column statistics