Re: Write Ahead Logging for Hash Indexes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Write Ahead Logging for Hash Indexes
Date: 2017-03-07 23:41:05
Message-ID: CA+Tgmob=y1moWXXVAcWb1=pJJer8=ZvEEhojY=Y246DFoo=eUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 7, 2017 at 5:08 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
>>> but I think as this patch is standalone, so we should not remove it
>>> from their existing usage, I have added those back and rebased the
>>> remaining patches.
>>
>> Great, thanks. 0001 looks good to me now, so committed.
>
> Committed 0002.

Here are some initial review thoughts on 0003 based on a first read-through.

+ affected by this setting. Example include system catalogs. For

Not grammatical. Perhaps "affected by this setting, such as system catalogs".

- mark meta page dirty and release buffer content lock and pin
+ mark meta page dirty and write WAL for update of metapage
+ release buffer content lock and pin

Was the change in the indentation level intentional?

+accomodated in the initial set of buckets and it is not unusual to have large

Spelling.

+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage. If the

Need two commas and an "or": (a) removing tuples from some of the
bucket pages, (b) before clearing the garbage flag, or (c)

+quite possible, that system crashes before completing the operation on

No comma. Also, "that the system crashes".

+the index will remain bloated and can impact performance of read and

and this can impact

+ /*
+ * we need to release and reacquire the lock on overflow buffer to ensure
+ * that standby shouldn't see an intermediate state of it.
+ */
+ LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);

I still think this is a bad idea. Releasing and reacquiring the lock
on the master doesn't prevent the standby from seeing intermediate
states; the comment, if I understand correctly, is just plain wrong.
What it does do is make the code on the master more complicated, since
now the page returned by _hash_addovflpage might theoretically fill up
while you are busy releasing and reacquiring the lock.

+ * Add the tuples (itups) to wbuf in this function, we could do that in the
+ * caller as well. The advantage of doing it here is we can easily write

Change comma to period. Then: "We could easily do that in the caller,
but the advantage..."

+ * Initialise the freed overflow page, here we can't complete zeroed the

Don't use British spelling, and don't use a comma to join two
sentences. Also "here we can't complete zeroed" doesn't seem right; I
don't know what that means.

+ * To replay meta page changes, we can log the entire metapage which
+ * doesn't seem advisable considering size of hash metapage or we can
+ * log the individual updated values which seems doable, but we prefer
+ * to perform exact operations on metapage during replay as are done
+ * during actual operation. That looks straight forward and has an
+ * advantage of using lesser space in WAL.

This sounds like it is describing two alternatives that were not
selected and then the one that was selected, but I don't understand
the difference between the second non-selected alternative and what
was actually picked. I'd just write "We need only log the one field
from the metapage that has changed." or maybe drop the comment
altogether.

+ XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);

The use of XLR_NORMAL_RDATAS here seems wrong. Presumably you should
pass the number of rdatas you will actually need, which is some
constant plus nitups. I think you should just write that constant
here directly, whether or not it happens to be equal to
XLR_NORMAL_RDATAS.

+ /*
+ * We need to release and if required reacquire the lock on
+ * rbuf to ensure that standby shouldn't see an intermediate
+ * state of it. If we don't release the lock, after replay of
+ * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
be able to
+ * view the results of partial deletion on rblkno.
+ */
+ LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);

If you DO release the lock, this will STILL be true, because what
matters on the standby is what the redo code does. This code doesn't
even execute on the standby, so how could it possibly affect what
intermediate states are visible there?

+ /*
+ * Here, we could have copied the entire metapage as well to WAL
+ * record and restore it as it is during replay. However the size of
+ * metapage is not small (more than 600 bytes), so recording just the
+ * information required to construct metapage.
+ */

Again, this seems sorta obvious. You've got a bunch of these comments
that say similar things. I'd either make them shorter or just remove
them.

+ * We can log the exact changes made to meta page, however as no
+ * concurrent operation could see the index during the replay of this
+ * record, we can perform the operations during replay as they are
+ * done here.

Don't use a comma to join two sentences. Also, I don't understand
anything up to the "however". I think you mean something like "We
don't need hash index creation to be atomic from the point of view of
replay, because until the creating transaction commits it's not
visible to anything on the standby anyway."

+ * Set pd_lower just past the end of the metadata. This is to log
+ * full_page_image of metapage in xloginsert.c.

Why the underscores?

+ * won't be a leak on standby, as next split will consume this space.

The master and the standby had better be the same, so I don't know
what it means to talk about a leak on the standby.

+ * Change the shared buffer state in critical section,
+ * otherwise any error could make it unrecoverable after
+ * recovery.

Uh, what? We only recover things during recovery. Nothing gets
recovered after recovery.

Maybe this could get some tests, via the isolation tester, for things
like snapshot too old?

I haven't gone through all of this in detail yet; I think most of it
is right on target. But there is a lot more to look at yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-03-07 23:43:10 Re: Bizarre choice of case for RELKIND_PARTITIONED_TABLE
Previous Message Tom Lane 2017-03-07 23:34:15 Re: cast result of copyNode()