Re: Parallel bitmap heap scan

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2017-01-09 07:35:24
Message-ID: CAFiTN-umB2eXQhzy1KqED=W+YLL=H8Sr+2FnJe24sNgEqqJqPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 6, 2017 at 10:47 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> This looks good now.

Thanks..
>
>
> Further points below ....

Thanks for the review.

> ===== nodeBItmapHeapscan.c ======
>
>
> In BitmapHeapNext(), the following code is relevant only for tbm
> returned from MultiExecProcNode().
> if (!tbm || !IsA(tbm, TIDBitmap))
> elog(ERROR, "unrecognized result from subplan");

Fixed

> --------------
>
> BitmapHeapScanState->is_leader field looks unnecessary. Instead, a
> local variable is_leader in BitmapHeapNext() will solve the purpose.
> This is because is_leader is used only in BitmapHeapNext().

Fixed
>
> --------------
>
> In BitmapHeapNext(), just before tbm_restore_shared_members() is
> called, we create tbm using tbm_create(). I think tbm_create() does
> not make sense for shared tbm. Whatever fields are required, will be
> restored in tbm_restore_shared_members(). The other fields which do
> not make sense in a restored tbm are not required to be initialized
> using tbm_create(). So I think tbm_restore_shared_members() itself can
> call makeNode(TIDBitmap). (Also it is not required to initialize
> tbm->allocator; see note below in tidbitmap.c section). So
> tbm_restore_shared_members() itself can call tbm_set_parallel().
> Looking at all this, it looks better to have the function name changed
> to tbm_attach(parallel_tbm) or tbm_restore(parallel_tbm) rather than
> tbm_restore_shared_members(). The function header anyways (rightly)
> says : Attach worker to shared TID bitmap.

Fixed

>
> -------------
>
> From what I understand, the leader worker does not have to create its
> iterators before waking up the other workers, as long as it makes sure
> it copies tbm fields into shared memory before waking workers. But in
> the patch, tbm_begin_iterate() is called *before* the
> ConditionVariableBroadcast() is called. So I feel, we can shift the
> code inside the "if (node->is_leader)" to a place inside the "if
> (pbminfo == NULL || pbms_is_leader(pbminfo))" condition block, just
> after MultiExecProcNode() call. (And we won't even need is_leader
> local variable as well). This way now the other workers will start
> working sooner.

Correct, Fixed.

>
>
>
> ====== tidbitmap.c =======
>
>
> The new #include's are not in alphabetical order.

Done..
>
> --------------
>
> ParallelTIDBitmap.inited is unused, and I believe, not required.

Fixed
>
> --------------
>
> For leader worker, the new TIDBitmap fields added for parallel bitmap
> *are* valid while the tid is being built. So the below comment should
> be shifted accordingly :
> /* these are valid when iterating is true: */
> Better still, the shared tbm-related fields can be kept in the end,
> and a comment should be added that these are for shared tbm.
>

Done
> --------------
>
> It seems, the comment below the last ParallelTIDBitmap field is not relevant :
> /* table to dsa_pointer's array. */

Fixed..
>
> --------------
>
> tbm->allocator field does not seem to be required. A new allocator can
> be just palloc'ed in tbm_create_pagetable(), and passed to
> pagetable_create(). SH_CREATE() stores this allocator in the
> SH_TYPE->alloc field, and fetches the same whenever it needs it for
> calling any of the allocator functions. So we can remove the
> tbm->allocator field and shift "palloc(sizeof(pagetable_alloc))" call
> from tbm_create() to tbm_create_pagetable().

Done
>
> --------------
>
> In tbm_free() :
> I think we should call pagetable_destroy() even when tbm is shared, so
> that the hash implementation gets a chance to free the hash table
> structure. I understand that the hash table structure itself is not
> big, so it will only be a small memory leak, but it's better to not
> assume that. Instead, let SH_DESTROY() call HashFree(). Then, in
> tbm_free_shared(), we can skip the dsa_free() call if tbm->iterating
> is false. Basically, tbm bitmap implementation should deduce from the
> bitmap state whether it should free the shared data, rather than
> preventing a call to SH_DESTROY().

Fixed
>
> -----------
>
> In tbm_begin_iterate(), for shared tbm, internal structures from
> simplehash.h are assumed to be known. For e.g., the hash entries will
> always be present in one single array, and also the entry status is
> evaluated using pagetable_IN_USE. Is simplehash.h designed keeping in
> mind that these things are suppose to be exposed ?
>
> I understand that the hash table handle is local to the leader worker,
> and so it is not accessible to other workers. And so, we cannot use
> pagetable_iterate() to scan the hash table. So, how about copying the
> SH_TYPE structure and making it accessible to other workers ? If we
> have something like SH_ATTACH() or SH_COPY(), this will copy the
> relevant fields that are sufficient to restore the SH_TYPE structure,
> and other workers can start using this hash table after assigning dsa
> array back to tb->data. Something like HASH_ATTACH used in dynahash.c.

This looks cleaner, and also avoid processing the data of the hash
directly. I have changed as per the suggestion.

>
> --------------
>
> In the tbm_update_shared_members() header comments :
> * Restore leaders private tbm state to shared location. This must
> * be called before waking up the other worker.
>
> Above can be changed to:
> * Store leader's private tbm state to shared location. This must
> * be called before waking up other workers.

Done

>
> --------------
>
> To be consistent with other function header comments in this file, a
> blank line is required between these two lines :
>
> * tbm_estimate_parallel_tidbitmap
> * Estimate size of shared TIDBitmap related info.
>
Done
> -------------
>
> tbm->is_shared is set in both tbm_set_parallel() and
> tbm_restore_shared_members(). I think it is needed only in
> tbm_set_parallel().

Done
>
> --------------
>
> tbm_alloc_shared() Function header comments need some typo correction :
> * It allocated memory from DSA and also stores dsa_pointer in the memory
> allocated => allocates
>
> --------------
>
> We prepend the dsa_pointer value into the shared memory data allocated
> for the pagetable entries; and we save the address of the first page
> table entry in tbm->data. But the same dsa_pointer is also stored in
> tbm->dsa_data. And tbm is accessible in tbm_free_shared(). So it does
> not look necessary to prepend dsa_pointer before the page table
> entries. In tbm_free_shared(), we can do dsa_free(tbm->area,
> tbm->dsa_data).

tbm->dsa_data allocates the latest allocated dsa_pointer. And, hash
table first allocates the new memory then free the old pointer. So
whenever it allocates the new memory we will store latest in
tbm->dsa_data, and when the hash table will call free it will call
with the local pointer and we don't have any reference to dsa pointer
for that old memory, that's the reason we store dsa pointer as header.

>
> --------------
>
> Below are my suggested changes in the algorithm comments :
>
> @@ -103,27 +103,27 @@ BitmapHeapNext(BitmapHeapScanState *node)

Changed.

Apart from these changes, I have created a separate
patch(0001-opt-parallelcost-refactoring-v7.patch) for code refactoring
in optimizer required for parallelbitmap heap scan. (earlier this was
part of my main patch)

This patch can be used by 0003-parallel-bitmap-heap-scan-v7.patch
attached in the mail and also parallel-index-scan[2] can be rebased on
this, if this get committed,

[2]
https://www.postgresql.org/message-id/CAA4eK1KthrAvNjmB2cWuUHz%2Bp3ZTTtbD7o2KUw49PC8HK4r1Wg%40mail.gmail.com

I will send the performance data with the new patch in the separate mail.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
0001-opt-parallelcost-refactoring-v7.patch application/octet-stream 12.8 KB
0002-hash-support-alloc-free-v7.patch application/octet-stream 6.2 KB
0003-parallel-bitmap-heap-scan-v7.patch application/octet-stream 44.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2017-01-09 08:15:39 Re: macaddr 64 bit (EUI-64) datatype support
Previous Message Jim Nasby 2017-01-09 00:50:12 Re: PoC: Make it possible to disallow WHERE-less UPDATE and DELETE