Re: Write Ahead Logging for Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Write Ahead Logging for Hash Indexes
Date: 2017-02-16 14:46:39
Message-ID: CAA4eK1JCk-abtsVMMP8xqGZktUH73ZmLaZ6b_+-oCRtRkdqPrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> As discussed, attached are refactoring patches and a patch to enable
>> WAL for the hash index on top of them.
>
> Thanks. I think that the refactoring patches shouldn't add
> START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that
> for the final patch. Nor should their comments reference the idea of
> writing WAL; that should also be for the final patch.
>

Okay, changed as per suggestion.

> PageGetFreeSpaceForMulTups doesn't seem like a great name.
> PageGetFreeSpaceForMultipleTuples?

Okay, changed as per suggestion.

> Or maybe just use
> PageGetExactFreeSpace and then do the account in the caller. I'm not
> sure it's really worth having a function just to subtract a multiple
> of sizeof(ItemIdData), and it would actually be more efficient to have
> the caller take care of this, since you wouldn't need to keep
> recalculating the value for every iteration of the loop.
>

I have tried this one, but I think even if track outside we need
something like PageGetFreeSpaceForMultipleTuples for the cases when we
have to try next write page/'s where data (set of index tuples) can be
accommodated.

> I think we ought to just rip (as a preliminary patch) out the support
> for freeing an overflow page in the middle of a bucket chain. That
> code is impossible to hit, I believe, and instead of complicating the
> WAL machinery to cater to a nonexistent case, maybe we ought to just
> get rid of it.
>

I also could not think of a case where we need to care about the page
in the middle of bucket chain as per it's current usage. In
particular, are you worried about the code related to nextblock number
in _hash_freeovflpage()? Surely, we can remove it, but what
complexity are you referring to here? There is some additional
book-keeping for primary bucket page, but there is nothing much about
maintaining a backward link. One more thing to note is that this is
an exposed API, so to avoid getting it used in some wrong way, we
might want to make it static.

> + /*
> + * We need to release and if required
> reacquire the lock on
> + * rbuf to ensure that standby
> shouldn't see an intermediate
> + * state of it.
> + */
>
> How does releasing and reacquiring the lock on the master affect
> whether the standby can see an intermediate state?

I am talking about the intermediate state of standby with respect to
the master. We will release the lock on standby after replaying the
WAL for moving tuples from read page to write page whereas master will
hold that for the much longer period (till we move all the tuples from
read page and free that page). I understand that there is no
correctness issue here, but was trying to make operations on master
and standby similar and another thing is that it will help us in
obeying the coding rule of releasing the lock on buffers after writing
WAL record. I see no harm in maintaining the existing coding pattern,
however, if you think that is not important in this case, then I can
change the code to avoid releasing the lock on read page.

> That sounds
> totally wrong to me (and also doesn't belong in a refactoring patch
> anyway since we're not emitting any WAL here).
>

Agreed that even if we want to do such a change, then also it should
be part of WAL patch. So for now, I have reverted that change.

> - Assert(PageIsNew(page));
>
> This worries me a bit. What's going on here?
>

This is related to below change.

+ /*
+ * Initialise the freed overflow page, here we can't complete zeroed the
+ *
page as WAL replay routines expect pages to be initialized. See
+ * explanation of RBM_NORMAL
mode atop XLogReadBufferExtended.
+ */
+ _hash_pageinit(ovflpage, BufferGetPageSize
(ovflbuf));

Basically, we need this routine to initialize freed overflow pages.
Also, if you see the similar function in btree (_bt_pageinit(Page
page, Size size)), it doesn't have any such Assertion.

Attached are refactoring patches. WAL patch needs some changes based
on above comments, so will post it later.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
0001-Expose-a-new-API-_hash_pgaddmultitup.patch application/octet-stream 15.5 KB
0002-Expose-an-API-hashinitbitmapbuffer.patch application/octet-stream 2.4 KB
0003-Restructure-_hash_addovflpage.patch application/octet-stream 9.3 KB
0004-Restructure-split-bucket-code.patch application/octet-stream 9.2 KB
0005-Restructure-hash-index-creation.patch application/octet-stream 14.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thom Brown 2017-02-16 14:54:00 Partitioning vs ON CONFLICT
Previous Message Robert Haas 2017-02-16 14:45:37 Re: Parallel Append implementation