Re: Batch update of indexes

From: Torsten Zühlsdorff <mailinglists(at)toco-domains(dot)de>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Batch update of indexes
Date: 2016-01-25 07:28:36
Message-ID: 56A5CEA4.3040802@toco-domains.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21.01.2016 18:47, Konstantin Knizhnik wrote:

>
> On 21.01.2016 19:09, Anastasia Lubennikova wrote:
>> What I meant is more like a BRIN-like combination of an index scan and
>> heap scan.
>> Maybe it could be called "deferred inserts" or "temporary read-only
>> index"
>> Maybe it's similar with mysql insert buffer
>> http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html
>> I think it'll be more clear with example. Please don't care about syntax.
>>
>> CREATE TABLE tbl (c1 int);
>> CREATE INDEX idx on tbl(c1);
>>
>> SET enable_deferred_insert(idx) = on;
>> At this moment, we save the last_indexed_item (its TID) somewhere in
>> index metapage.
>>
>> Since that moment, the data inserted into the table doesn't touch the
>> index.
>> We perform some heavy insert and then go back to the normal index
>> behavior.
>>
>> SET enable_deferred_insert(idx) = off;
>> This command takes all the data between the last_indexed_item and the
>> end of the table, and inserts it into the index at a time.
>>
>> Of course there are new problems to deal with, but it's really useful
>> for the use case to balance irregular heavy write load, isn't it?
>>
>> BTW, could you explain, what is the reason to copy data into the
>> pending list and then copy it again while flushing pending list into
>> the index? Why not read this data directly from the table? I feel that
>> I've missed something important here.
>>
> No, I do not think that inserted data should be placed in pending list
> and then copied to main table.
> It should be stored directly in the main table and "pending list" is
> just some fast, transient index.
>
> From my point of view there are two possibilities:
> 1. Preserve strict table-index consistency: query results should not
> depend on presence of the index
> 2. Support out-of-date or deferred indexes, which can be updated in
> background.
>
> Second approach is certainty more efficient and IMHO it acceptable for
> most of the users.
> But we are violating one of the fundamental properties of RDBMes...
> So I am not sure which approach to chose.
>
> First case is also harder to implement, because we have to somehow merge
> two indexes during index scan
> and provide proper recovery of main index in case of failure (assuming
> that pending list is maintained in memory and is lost after the fault).

We already have unlogged tables which loose their data when there was an
unclean shutdown. Would it be acceptable to create something like that
for this "deferred index"? When they are in sync everything is fine.
When there was an error and the index was not up to date a reindex is
needed. This would be a bad thing on very big indexes, but for most
indexes this could be fine. Thoughts?

Greetings,
Torsten

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-01-25 07:30:47 Re: silent data loss with ext4 / all current versions
Previous Message Tom Lane 2016-01-25 07:07:01 Re: Add generate_series(date, date) and generate_series(date, date, integer)