Re: Write Ahead Logging for Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Write Ahead Logging for Hash Indexes
Date: 2017-03-08 12:45:25
Message-ID: CAA4eK1KibVzgVETVay0+siVEgzaXnP5R21BdWiK9kg9wx2E40Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 8, 2017 at 5:11 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Here are some initial review thoughts on 0003 based on a first read-through.
>
>
> + /*
> + * we need to release and reacquire the lock on overflow buffer to ensure
> + * that standby shouldn't see an intermediate state of it.
> + */
> + LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
> + LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
>
> I still think this is a bad idea. Releasing and reacquiring the lock
> on the master doesn't prevent the standby from seeing intermediate
> states; the comment, if I understand correctly, is just plain wrong.
>

I can remove this release and reacquire stuff, but first, let me try
to explain what intermediate state I am talking here. If we don't
have a release/reacquire lock here, standby could see an empty
overflow page at that location whereas master will be able to see the
new page only after the insertion of tuple/'s in that bucket is
finished. I understand that there is not much functionality impact of
this change but still wanted to make you understand why I have added
this code. I am okay with removing this code if you still feel this
is just a marginal case and we should not bother about it. Let me know
what you think?

>
> + * Initialise the freed overflow page, here we can't complete zeroed the
>
> Don't use British spelling, and don't use a comma to join two
> sentences. Also "here we can't complete zeroed" doesn't seem right; I
> don't know what that means.

What the comment means to say is that we can't initialize page as zero
(something like below)
MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));

typo in sentence
/complete/completely

>
> + * To replay meta page changes, we can log the entire metapage which
> + * doesn't seem advisable considering size of hash metapage or we can
> + * log the individual updated values which seems doable, but we prefer
> + * to perform exact operations on metapage during replay as are done
> + * during actual operation. That looks straight forward and has an
> + * advantage of using lesser space in WAL.
>
> This sounds like it is describing two alternatives that were not
> selected and then the one that was selected, but I don't understand
> the difference between the second non-selected alternative and what
> was actually picked.
>

Actually, the second alternative was picked.

> I'd just write "We need only log the one field
> from the metapage that has changed." or maybe drop the comment
> altogether.
>

I have added the comment so that others can understand why we don't
want to log the entire meta page (I think in some other indexes, we
always log the entire meta page, so such a question can arise).
However, after reading your comment, I think it is overkill and I
prefer to drop the entire comment unless you feel differently.

>
> + /*
> + * We need to release and if required reacquire the lock on
> + * rbuf to ensure that standby shouldn't see an intermediate
> + * state of it. If we don't release the lock, after replay of
> + * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
> be able to
> + * view the results of partial deletion on rblkno.
> + */
> + LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
>
> If you DO release the lock, this will STILL be true, because what
> matters on the standby is what the redo code does.
>

That's right, what I intend to do here is to release the lock in
master where it will be released in standby. In this case, we can't
ensure what user can see in master is same as in standby after this
WAL record is replayed, because in master we have exclusive lock on
write buf, so no one can read the contents of read buf (the location
of read buf will be after write buf) whereas, in standby, it will be
possible to read the contents of the read buf. I think this is not a
correctness issue, so we might want to leave as it is, what do you
say?

>
> + * We can log the exact changes made to meta page, however as no
> + * concurrent operation could see the index during the replay of this
> + * record, we can perform the operations during replay as they are
> + * done here.
>
> Don't use a comma to join two sentences. Also, I don't understand
> anything up to the "however".
>

I mean the below changes made to meta buf (refer existing code):
metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;

metap->hashm_nmaps++;

I think it will be clear if you look both the DO and REDO operation of
XLOG_HASH_INIT_BITMAP_PAGE. Let me know if you still think that the
comment needs to be changed?

>
> + * Set pd_lower just past the end of the metadata. This is to log
> + * full_page_image of metapage in xloginsert.c.
>
> Why the underscores?
>
> + * won't be a leak on standby, as next split will consume this space.
>

No specific reason, just trying to resemble full_page_writes, but can
change if you feel it doesn't make much sense.

> + * won't be a leak on standby, as next split will consume this space.

> The master and the standby had better be the same,

That's right.

> so I don't know
> what it means to talk about a leak on the standby.
>

I just want to say that even if we fail after allocation of buckets
and before the split operation is completed, the newly allocated
buckets will neither be leaked on master nor on standby. Do you think
we should add a comment for same or is it too obvious? How about
changing it to something like:

"Even if we fail after this operation, it won't leak buckets, as next
split will consume this space. In any case, even without failure, we
don't use all the space in one split operation."

> + * Change the shared buffer state in critical section,
> + * otherwise any error could make it unrecoverable after
> + * recovery.
>
> Uh, what? We only recover things during recovery. Nothing gets
> recovered after recovery.
>

Right, how about "Change the shared buffer state in a critical
section, otherwise any error could make the page unrecoverable."

> Maybe this could get some tests, via the isolation tester, for things
> like snapshot too old?
>

Okay, I can try, but note that currently there is no test related to
"snapshot too old" for any other indexes.

> I haven't gone through all of this in detail yet; I think most of it
> is right on target. But there is a lot more to look at yet.
>

Thanks for your valuable comments.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-03-08 13:07:25 Re: Foreign Join pushdowns not working properly for outer joins
Previous Message Masahiko Sawada 2017-03-08 12:05:15 Re: Cannot shutdown subscriber after DROP SUBSCRIPTION