Re: Page Scan Mode in Hash Index

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Scan Mode in Hash Index
Date: 2017-09-20 09:37:36
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Thanks for all your review comments. Please find my comments in-line.

On Tue, Sep 19, 2017 at 9:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Aug 24, 2017 at 11:26 AM, Jesper Pedersen
> <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>> Based on the feedback in this thread, I have moved the patch to "Ready for
>> Committer".
> Reviewing 0001:
> _hash_readpage gets the page LSN to see if we can apply LP_DEAD hints,
> but if the table is unlogged or temporary, the LSN will never change,
> so the test in _hash_kill_items will always think that it's OK to
> apply the hints. (This seems like it might be a pretty serious
> problem, because I'm not sure what would be a viable workaround.)

Amit has already replied to this query up-thread.

> The logic that tries to ensure that so->currPos.{buf,currPage,lsn} get
> updated is not, to my eyes, obviously correct. Generally, the logic
> for this stuff seems unnaturally spread out to me. For example,
> _hash_next() updates currPos.buf, but leaves it to _hash_readpage to
> set currPage and lsn. That function also sets all three fields when
> it advances to the next page by calling _hash_readnext(), but when it
> tries to advance to the next page and doesn't find one it sets buf but
> not currPage or lsn. It seems to me that this logic should be more
> centralized somehow. Can we arrange things so that at least buf,
> currPage, and lsn, and maybe also nextPage and prevPage, get updated
> at the same time and as soon after reading the buffer as possible?

Okay, I have tried to update currPos.{buf, currPage, lsn} in
_hash_readpage() at the same time. Please have a look into the
attached 0001*.patch.

When _hash_readpage() doesn't find any qualifying tuples i.e. when
_hash_readnext() returns Invalid buffer, we just update prevPage,
nextPage and buf in
currPos (not currPage or lsn) as currPage and lsn should point to last
page in the hash bucket so that we can mark the killed items as dead
at the end of scan (with the help of _hash_kill_items). Hence, we keep
the currpage and lsn as it is if no more valid hash pages are found.

> It would be bad if a primary bucket page's hasho_prevblkno field got
> copied into so->currPos.prevpage, because the value that appears for
> the primary bucket is not a real block number. But _hash_readpage
> seems like it can bring about this exact situation, because of this
> code:
> + if (!BlockNumberIsValid(opaque->hasho_nextblkno))
> + prev_blkno = opaque->hasho_prevblkno;
> ...
> + so->currPos.prevPage = prev_blkno;
> If we're reading the primary bucket page and there are no overflow
> pages, hasho_nextblkno will not be valid and hasho_prevblkno won't be
> a real block number.

Fixed. Thanks for putting that point.

> Incidentally, the "if" statement in the above block of code is
> probably not saving anything; I would suggest for clarity that you do
> the assignment unconditionally (but this is just a matter of style, so
> I don't feel super-strongly about it).
> + return (so->currPos.firstItem <= so->currPos.lastItem);
> Can this ever return false? It looks to me like we should never reach
> this code unless that condition holds, and that it would be a bug if
> we did. If that's correct, maybe this should be an Assert() and the
> return statement should just return true.

No, it will never return FALSE. I have changed it to Assert statement.

> + buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
> +
> + /* It might not exist anymore; in which case we can't hint it. */
> + if (!BufferIsValid(buf))
> + return;
> This is dead code, because _hash_getbuf always returns a valid buffer.
> If there's actually a risk of the buffer disappearing, then some other
> handling is needed for this case. But I suspect that, since a scan
> always holds a pin on the primary bucket, there's actually no way for
> this to happen and this is just dead code.

Removed the redundant code.

> The comment in hashgetbitmap claims that _hash_first() or _hash_next()
> never returns dead tuples. If that were true, it would be a bug,
> because then scans started during recovery would return wrong answers.
> A more accurate statement would be something like: _hash_first and
> _hash_next handle eliminate dead index entries whenever
> scan->ignored_killed_tuples is true. Therefore, there's nothing to do
> here except add the results to the TIDBitmap.


> _hash_readpage contains unnecessary "continue" statements inside the
> loops. The reason that they're unnecessary is that there's no code
> below that in the loop anyway, so the loop is already going to go back
> around to the top. Whether to change this is a matter of style, so I
> won't complain too much if you want to leave it the way it is, but
> personally I'd favor removing them.

Ohh no, that's a silly mistake. I have corrected it.

With Regards,
Ashutosh Sharma

Attachment Content-Type Size
0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v15.patch text/x-patch 31.8 KB
0002-Remove-redundant-hash-function-_hash_step-and-do-som.patch text/x-patch 8.5 KB
0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7.patch text/x-patch 7.8 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2017-09-20 09:43:18 Re: Allow GiST opcalsses without compress\decompres functions
Previous Message Rushabh Lathia 2017-09-20 09:32:46 Re: Parallel tuplesort (for parallel B-Tree index creation)