Re: Write Ahead Logging for Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Write Ahead Logging for Hash Indexes
Date: 2017-03-13 12:56:14
Message-ID: CAA4eK1+k5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn+72o1UA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>> Great, thanks. 0001 looks good to me now, so committed.
>>>
>>> Committed 0002.
>>
>> Here are some initial review thoughts on 0003 based on a first read-through.
>
> More thoughts on the main patch:
>
> The text you've added to the hash index README throws around the term
> "single WAL entry" pretty freely. For example: "An insertion that
> causes an addition of an overflow page is logged as a single WAL entry
> preceded by a WAL entry for new overflow page required to insert a
> tuple." When you say a single WAL entry, you make it sound like
> there's only one, but because it's preceded by another WAL entry,
> there are actually two. I would rephase this to just avoid using the
> word "single" this way. For example: "If an insertion causes the
> addition of an overflow page, there will be one WAL entry for the new
> overflow page and a second entry for the insert itself."
>

Okay, changed as per suggestion. I have also removed the second part
of the above comment which states that newly allocated overflow page
can be used by concurrent transactions. That won't be true after
fixing release/reacquire stuff in _hash_addovflpage. I have also
rephrased another similar usage of "single WAL entry".

> +mode anyway). It would seem natural to complete the split in VACUUM, but since
> +splitting a bucket might require allocating a new page, it might fail if you
> +run out of disk space. That would be bad during VACUUM - the reason for
> +running VACUUM in the first place might be that you run out of disk space,
> +and now VACUUM won't finish because you're out of disk space. In contrast,
> +an insertion can require enlarging the physical file anyway.
>
> But VACUUM actually does try to complete the splits, so this is wrong, I think.
>

As discussed up thread, there is no need to change.

> +Squeeze operation moves tuples from one of the buckets later in the chain to
>
> A squeeze operation
>
> +As Squeeze operation involves writing multiple atomic operations, it is
>
> As a squeeze operation involves multiple atomic operations
>

Changed as per suggestion.

> + if (!xlrec.is_primary_bucket_page)
> + XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
>
> So, in hashbucketcleanup, the page is registered and therefore an FPI
> will be taken if this is the first reference to this page since the
> checkpoint, but hash_xlog_delete won't restore the FPI. I think that
> exposes us to a torn-page hazard.
> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
> Not sure exactly what the fix should be.
>

I have used REGBUF_NO_IMAGE while registering buffer to avoid this
hazard and during replay used XLogReadBufferForRedoExtended() to read
block and take cleanup lock on it.

> + /*
> + * As bitmap page doesn't have standard page layout, so this will
> + * allow us to log the data.
> + */
> + XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
> + XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
>
> If the page doesn't have a standard page layout, why are we passing
> REGBUF_STANDARD? But I think you've hacked things so that bitmap
> pages DO have a standard page layout, per the change to
> _hash_initbitmapbuffer, in which case the comment seems wrong.
>

Yes, the comment is obsolete and I have removed it. We need standard
page layout for bitmap page, so that full page writes won't exclude
the bitmappage data.

>
> + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
> +
> + PageSetLSN(BufferGetPage(ovflbuf), recptr);
> + PageSetLSN(BufferGetPage(buf), recptr);
> +
> + if (BufferIsValid(mapbuf))
> + PageSetLSN(BufferGetPage(mapbuf), recptr);
> +
> + if (BufferIsValid(newmapbuf))
> + PageSetLSN(BufferGetPage(newmapbuf), recptr);
>
> I think you forgot to call PageSetLSN() on metabuf. (There are 5
> block references in the write-ahead log record but only 4 calls to
> PageSetLSN ... surely that can't be right!)
>

Right, fixed.

> + /*
> + * We need to release the locks once the prev pointer of overflow bucket
> + * and next of left bucket are set, otherwise concurrent read might skip
> + * the bucket.
> + */
> + if (BufferIsValid(leftbuf))
> + UnlockReleaseBuffer(leftbuf);
> + UnlockReleaseBuffer(ovflbuf);
>
> I don't believe the comment. Postponing the lock release wouldn't
> cause the concurrent read to skip the bucket. It might cause it to be
> delayed unnecessarily, but the following comment already addresses
> that point. I think you can just nuke this comment.
>

Okay, removed the comment.

> + xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
> + xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
>
> Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar?
>

Changed as per suggestion.

> + /*
> + * We release locks on writebuf and bucketbuf at end of replay operation
> + * to ensure that we hold lock on primary bucket page till end of
> + * operation. We can optimize by releasing the lock on write buffer as
> + * soon as the operation for same is complete, if it is not same as
> + * primary bucket page, but that doesn't seem to be worth complicating the
> + * code.
> + */
> + if (BufferIsValid(writebuf))
> + UnlockReleaseBuffer(writebuf);
> +
> + if (BufferIsValid(bucketbuf))
> + UnlockReleaseBuffer(bucketbuf);
>
> Here in hash_xlog_squeeze_page(), you wait until the very end to
> release these locks, but in e.g. hash_xlog_addovflpage you release
> them considerably earlier for reasons that seem like they would also
> be valid here. I think you could move this back to before the updates
> to the metapage and bitmap page.
>

Sure, doesn't see any problem with releasing locks early, so changed
as per suggestion.

> + /*
> + * Note: in normal operation, we'd update the metapage while still
> + * holding lock on the page we inserted into. But during replay it's
> + * not necessary to hold that lock, since no other index updates can
> + * be happening concurrently.
> + */
>
> This comment in hash_xlog_init_bitmap_page seems to be off-base,
> because we didn't just insert into a page. I'm not disputing that
> it's safe, though: not only can nobody else be modifying it because
> we're in recovery, but also nobody else can see it yet; the creating
> transaction hasn't yet committed.
>

Fixed the comment and mentioned the second reason as suggested by you
(that sounds appropriate here).

> + /*
> + * Note that we still update the page even if it was restored from a full
> + * page image, because the bucket flag is not included in the image.
> + */
>
> Really? Why not? (Because it's part of the special space?)
>

Added a comment as discussed above.

> Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
> guard against concurrent scans?

Changed and made it similar to normal operations by taking cleanup lock.

> + /*
> + * There is no harm in releasing the lock on old bucket before new bucket
> + * during replay as no other bucket splits can be happening concurrently.
> + */
> + if (BufferIsValid(oldbuf))
> + UnlockReleaseBuffer(oldbuf);
>
> And I'm not sure this is safe either. The issue isn't that there
> might be another bucket split happening concurrently, but that there
> might be scans that get confused by seeing the bucket split flag set
> before the new bucket is created or the metapage updated. Seems like
> it would be safer to keep all the locks for this one.
>

Changed as discussed.

> + * Initialise the new bucket page, here we can't complete zeroed the page
>
> I think maybe a good way to write these would be "we can't simply zero
> the page". And Initialise -> Initialize.
>

I have changed the comment, see if that looks okay to you.

> /*
> + * Change the shared buffer state in critical section,
> + * otherwise any error could make it unrecoverable after
> + * recovery.
> + */
> + START_CRIT_SECTION();
> +
> + /*
> * Insert tuple on new page, using _hash_pgaddtup to ensure
> * correct ordering by hashkey. This is a tad inefficient
> * since we may have to shuffle itempointers repeatedly.
> * Possible future improvement: accumulate all the items for
> * the new page and qsort them before insertion.
> */
> (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
>
> + END_CRIT_SECTION();
>
> No way. You have to start the critical section before making any page
> modifications and keep it alive until all changes have been logged.
>

Changed as discussed up thread.

> hash_redo's mappings of record types to function names is a little
> less regular than seems desirable. Generally, the function name is
> the lower-case version of the record type with "hash" and "xlog"
> switched. But XLOG_HASH_ADD_OVFL_PAGE and
> XLOG_HASH_SPLIT_ALLOCATE_PAGE break the pattern.
>

Okay, changed as per suggestion and I went ahead and change the name
of corresponding xlog record structures as well because those also
match the function name pattern.

> It seems like a good test to do with this patch would be to set up a
> pgbench test on the master with a hash index replacing the usual btree
> index. Then, set up a standby and run a read-only pgbench on the
> standby while a read-write pgbench test runs on the master. Maybe
> you've already tried something like that?
>

I also think so and apart from that I think it makes sense to perform
recovery test by Jeff Janes tool and probably tests with
wal_consistency_check. These tests are already running from past seven
hours or so and I will keep them running for the whole night to see if
there is any discrepancy. Thanks to Kuntal and Ashutosh for helping
me in doing these stress tests.

Note that I have fixed all other comments raised by you in another
e-mail including adding the test for snapshot too old (I have prepared
a separate patch for the test. I have added just one test to keep the
timing short because each test will take minimum 6s to complete. This
6s requirement is what is required by existing tests or any other test
for snapshot too old). Let me know if I have missed anything?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
wal_hash_index_v10.patch application/octet-stream 91.3 KB
wal_hash_index_test_v1.patch application/octet-stream 2.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-03-13 12:59:46 Re: Write Ahead Logging for Hash Indexes
Previous Message Rafia Sabih 2017-03-13 12:18:54 Re: Enabling parallelism for queries coming from SQL or other PL functions