Re: Parallel bitmap heap scan

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2017-02-26 15:56:34
Message-ID: CA+TgmoZF6iswmB-2LX4ySgJ-_gnGqWPnh14D_xDOszkVAfNCjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 21, 2017 at 10:20 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> 0001:
> 1. I have got a new interface, "tbm_free_shared_area(dsa_area *dsa,
> dsa_pointer dp)" which will be responsible for freeing all the shared
> members (pagetable, ptpage and ptchunk). Actually, we can not do this
> in tbm_free itself because the only leader will have a local tbm with
> reference to all these pointers and our parallel bitmap leader may not
> necessarily be the actual backend.

I'm not entirely sure about the calling convention for
tbm_free_shared_area() but the rest seems OK.

> 2. Now tbm_free is not freeing any of the shared members which can be
> accessed by other worker so tbm_free is safe to call from
> ExecEndBitmapHeapScan without any safety check or ref count.

That also seems fine. We ended up with something very similar in the
Parallel Index Scan patch.

> 0002:
> 1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
> not freeing any shared member in ExecEndBitmapHeapScan.
> 2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
> free the shared members of the TBM.
> 3. After that, we will free TBMSharedIteratorState what we allocated
> using tbm_prepare_shared_iterate.

Check. But I think tbm_free_shared_area() should also free the object
itself, instead of making the caller do that separately.

+ if (DsaPointerIsValid(node->pstate->tbmiterator))
+ {
+ /* First we free the shared TBM members using the shared state */
+ tbm_free_shared_area(dsa, node->pstate->tbmiterator);
+ dsa_free(dsa, node->pstate->tbmiterator);
+ }
+
+ if (DsaPointerIsValid(node->pstate->prefetch_iterator))
+ dsa_free(dsa, node->pstate->prefetch_iterator);

The fact that these cases aren't symmetric suggests that your
abstraction is leaky. I'm guessing that you can't call
tbm_free_shared_area because the two iterators share one copy of the
underlying iteration arrays, and the TBM code isn't smart enough to
avoid freeing them twice. You're going to have to come up with a
better solution to that problem; nodeBitmapHeapScan.c shouldn't know
about the way the underlying storage details are managed. (Maybe you
need to reference-count the iterator arrays?)

+ if (node->inited)
+ goto start_iterate;

My first programming teacher told me not to use goto. I've
occasionally violated that rule, but I need a better reason than you
have here. It looks very easy to avoid.

+ pbms_set_parallel(outerPlanState(node));

I think this should be a flag in the plan, and the planner should set
it correctly, instead of having it be a flag in the executor that the
executor sets. Also, the flag itself should really be called
something that involves the word "shared" rather than "parallel",
because the bitmap will not be created in parallel, but it will be
shared.

Have you checked whether this patch causes any regression in the
non-parallel case? It adds a bunch of "if" statements, so it might.
Hopefully not, but it needs to be carefully tested.

@@ -48,10 +48,11 @@
#include "utils/snapmgr.h"
#include "utils/tqual.h"

-
static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);

Unnecessary.

+static bool pbms_is_leader(ParallelBitmapState *pbminfo);
+static void pbms_set_parallel(PlanState *node);

I don't think this "pbms" terminology is very good. It's dissimilar
to the way other functions in this file are named. Maybe
BitmapShouldInitializeSharedState().

I think that some of the bits that this function makes conditional on
pstate should be factored out into inline functions. Like:

- if (node->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
- else if (node->prefetch_target >= node->prefetch_maximum / 2)
- node->prefetch_target = node->prefetch_maximum;
- else if (node->prefetch_target > 0)
- node->prefetch_target *= 2;
- else
- node->prefetch_target++;
+ if (!pstate)
+ {
+ if (node->prefetch_target >= node->prefetch_maximum)
+ /* don't increase any further */ ;
+ else if (node->prefetch_target >= node->prefetch_maximum / 2)
+ node->prefetch_target = node->prefetch_maximum;
+ else if (node->prefetch_target > 0)
+ node->prefetch_target *= 2;
+ else
+ node->prefetch_target++;
+ }
+ else if (pstate->prefetch_target < node->prefetch_maximum)
+ {
+ SpinLockAcquire(&pstate->prefetch_mutex);
+ if (pstate->prefetch_target >= node->prefetch_maximum)
+ /* don't increase any further */ ;
+ else if (pstate->prefetch_target >=
+ node->prefetch_maximum / 2)
+ pstate->prefetch_target = node->prefetch_maximum;
+ else if (pstate->prefetch_target > 0)
+ pstate->prefetch_target *= 2;
+ else
+ pstate->prefetch_target++;
+ SpinLockRelease(&pstate->prefetch_mutex);
+ }

I suggest creating an inline function like BitmapAdjustPrefetch() for
this logic, and letting the code call that. The function can look
something like this: if (pstate == NULL) { non-parallel stuff; return;
} parallel stuff follows...

And similarly for the other cases where you've made the logic
conditional. This will make it more clear what's happening
post-patch, I think, and will also help keep the level of indentation
from getting out-of-control in certain places. In fact, maybe you
should submit a preliminary refactoring patch that moves these chunks
of logic into functions and then the main patch can apply over top of
that.

+ bool inited;

Suggest: initialized

- * ----------------
+ * pscan_len size of the shared memory for parallel bitmap
+ * inited is node is ready to iterate
+ * stbmiterator shared iterator
+ * sprefetch_iterator shared iterator for prefetching
+ * pstate shared state for parallel bitmap scan
+ *----------------

No need to change number of dashes.

+ * PBMState information : Current status of the TIDBitmap creation during
+ * parallel bitmap heap scan.

If you look for existing places where comments are formatted like
this, I bet you won't find many. Copy the surrounding style more.

+ dsa_pointer tbmiterator;
+ dsa_pointer prefetch_iterator;
+ slock_t prefetch_mutex;
+ int prefetch_pages;
+ int prefetch_target;
+ bool prefetching;
+ slock_t state_mutex;
+ PBMState state;
+ ConditionVariable cv;
+ char phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];

I think it is probably not a good idea to have two separate mutexes
here. They'll be in the same cache line, so it won't be much faster
than having one mutex, and the state mutex won't be acquired very
often so you won't really gain anything anyway. I think you can just
merge the mutexes into one called 'mutex'.

+ /* Allow only one process to prefetch */

If this is a good idea, there should be a comment explaining why.

pbms_is_leader() looks horrifically inefficient. Every worker will
reacquire the spinlock for every tuple. You should only have to enter
this spinlock-acquiring loop for the first tuple. After that, either
you became the leader, did the initialization, and set the state to
PBM_FINISHED, or you waited until the pre-existing leader did the
same. You should have a backend-local flag that keeps you from
touching the spinlock for every tuple. I wouldn't be surprised if
fixing this nets a noticeable performance increase for this patch at
high worker counts.

+ TBMSharedIterator *stbmiterator;
+ TBMSharedIterator *sprefetch_iterator;

Maybe shared_iterator and shared_prefetch_iterator.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-02-26 16:46:50 Re: Proposal : For Auto-Prewarm.
Previous Message Erik Rijkers 2017-02-26 15:05:28 Re: Logical replication existing data copy