Quick Links

Write Ahead Logging for Hash Indexes

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Write Ahead Logging for Hash Indexes
Date:	2016-08-23 03:24:39
Message-ID:	CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

$SUBJECT will make hash indexes reliable and usable on standby.
AFAIU, currently hash indexes are not recommended to be used in
production mainly because they are not crash-safe and with this patch,
I hope we can address that limitation and recommend them for use in
production.

This patch is built on my earlier patch [1] of making hash indexes
concurrent. The main reason for doing so is that the earlier patch
allows to complete the split operation and used light-weight locking
due to which operations can be logged at granular level.

WAL for different operations:

This has been explained in README as well, but I am again writing it
here for the ease of people.
=================================

Multiple WAL records are being written for create index operation,
first for initializing the metapage, followed by one for each new
bucket created during operation followed by one for initializing the
bitmap page. If the system crashes after any operation, the whole
operation is rolledback. I have considered to write a single WAL
record for the whole operation, but for that we need to limit the
number of initial buckets that can be created during the operation. As
we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with
current XLog machinery, it is better to write multiple WAL records for
this operation. The downside of restricting the number of buckets is
that we need to perform split operation if the number of tuples are
more than what can be accommodated in initial set of buckets and it is
not unusual to have large number of tuples during create index
operation.

Ordinary item insertions (that don't force a page split or need a new
overflow page) are single WAL entries. They touch a single bucket
page and meta page, metapage is updated during replay as it is updated
during original operation.

An insertion that causes an addition of an overflow page is logged as
a single WAL entry preceded by a WAL entry for a new overflow page
required to insert a tuple. There is a corner case where by the time
we try to use newly allocated overflow page, it already gets used by
concurrent insertions, for such a case, a new overflow page will be
allocated and a separate WAL entry will be made for the same.

An insertion that causes a bucket split is logged as a single WAL
entry, followed by a WAL entry for allocating a new bucket, followed
by a WAL entry for each overflow bucket page in the new bucket to
which the tuples are moved from old bucket, followed by a WAL entry to
indicate that split is complete for both old and new buckets.

A split operation which requires overflow pages to complete the
operation will need to write a WAL record for each new allocation of
an overflow page. As splitting involves multiple atomic actions, it's
possible that the system crashes between moving tuples from bucket
pages of old bucket to new bucket. After recovery, both the old and
new buckets will be marked with in_complete split flag. The reader
algorithm works correctly, as it will scan both the old and new
buckets.

We finish the split at next insert or split operation on old bucket.
It could be done during searches, too, but it seems best not to put
any extra updates in what would otherwise be a read-only operation
(updating is not possible in hot standby mode anyway). It would seem
natural to complete the split in VACUUM, but since splitting a bucket
might require to allocate a new page, it might fail if you run out of
disk space. That would be bad during VACUUM - the reason for running
VACUUM in the first place might be that you run out of disk space, and
now VACUUM won't finish because you're out of disk space. In
contrast, an insertion can require enlarging the physical file anyway.

Deletion of tuples from a bucket is performed for two reasons, one for
removing the dead tuples and other for removing the tuples that are
moved by split. WAL entry is made for each bucket page from which
tuples are removed, followed by a WAL entry to clear the garbage flag
if the tuples moved by split are removed. Another separate WAL entry
is made for updating the metapage if the deletion is performed for
removing the dead tuples by vaccum.

As deletion involves multiple atomic operations, it is quite possible
that system crashes after (a) removing tuples from some of the bucket
pages (b) before clearing the garbage flag (c) before updating the
metapage. If the system crashes before completing (b), it will again
try to clean the bucket during next vacuum or insert after recovery
which can have some performance impact, but it will work fine. If the
system crashes before completing (c), after recovery there could be
some additional splits till the next vacuum updates the metapage, but
the other operations like insert, delete and scan will work correctly.
We can fix this problem by actually updating the metapage based on
delete operation during replay, but not sure if it is worth the
complication.

Squeeze operation moves tuples from one of the buckets later in the
chain to one of the bucket earlier in chain and writes WAL record when
either the bucket to which it is writing tuples is filled or bucket
from which it is removing the tuples becomes empty.

As Squeeze operation involves writing multiple atomic operations, it
is quite possible, that system crashes before completing the operation
on entire bucket. After recovery, the operations will work correctly,
but the index will remain bloated and can impact performance of read
and insert operations until the next vacuum squeezes the bucket
completely.

=====================================

One of the challenge in writing this patch was that the current code
was not written with a mindset that we need to write WAL for different
operations. Typical example is _hash_addovflpage() where pages are
modified across different function calls and all modifications needs
to be done atomically, so I have to refactor some code so that the
operations can be logged sensibly.

Thanks to Ashutosh Sharma who has helped me in completing the patch by
writing WAL for create index and delete operation and done the
detailed testing of patch by using pg_filedump tool. I think it is
better if he himself explains the testing he has done to ensure
correctness of patch.

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
wal_hash_index_v1.patch	application/octet-stream	120.4 KB

Responses

Re: Write Ahead Logging for Hash Indexes at 2016-08-23 04:11:11 from Amit Kapila
Re: Write Ahead Logging for Hash Indexes at 2016-08-23 21:07:03 from Jeff Janes
Re: Write Ahead Logging for Hash Indexes at 2016-08-24 00:09:48 from Mark Kirkwood
Re: Write Ahead Logging for Hash Indexes at 2016-08-24 18:16:12 from Alvaro Herrera

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2016-08-23 04:11:11	Re: Write Ahead Logging for Hash Indexes
Previous Message	Tom Lane	2016-08-23 01:44:13	Re: Forbid use of LF and CR characters in database and role names