Re: Parallel bitmap heap scan

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2017-02-16 20:31:07
Message-ID: CA+Tgmoa6TobamkaDmjX=BSRZjJSofXFphivzHB9-y2P=ttDRFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 16, 2017 at 10:54 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> interface_dsa_allocate0.patch : Provides new interface dsa_allocate0,
> which is used by 0001

Committed. I moved the function prototype for dsa_allocate0() next to
the existing prototype for dsa_allocate(). Please try to think about
things like this when you add functions or prototypes or structure
members: put them in a natural place where they are close to related
things, not just wherever happens to be convenient.

> gather_shutdown_childeren_first.patch : Processing the child node

This same problem came up on the Parallel Hash thread. I mentioned
this proposed fix over there; let's see what he (or anyone else) has
to say about it.

> 0001: tidbimap change to provide the interfaces for shared iterator.

+ * in order to share relptrs of the chunk and the pages arrays and other
+ * TBM related information with other workers.

No relptrs any more.

+ dsa_pointer dsapagetable; /* dsa_pointer to the element array */
+ dsa_pointer dsapagetableold; /* dsa_pointer to the old element array */
+ dsa_area *dsa; /* reference to per-query dsa area */
+ dsa_pointer ptpages; /* dsa_pointer to the page array */
+ dsa_pointer ptchunks; /* dsa_pointer to the chunk array */

Let's put the DSA pointer first and then the other stuff after it.
That seems more logical.

+typedef struct TBMSharedIteratorState
+{
+ int spageptr; /* next spages index */
+ int schunkptr; /* next schunks index */
+ int schunkbit; /* next bit to check in current schunk */
+ LWLock lock; /* lock to protect above members */
+ dsa_pointer pagetable; /* dsa pointers to head of pagetable data */
+ dsa_pointer spages; /* dsa pointer to page array */
+ dsa_pointer schunks; /* dsa pointer to chunk array */
+ int nentries; /* number of entries in pagetable */
+ int maxentries; /* limit on same to meet maxbytes */
+ int npages; /* number of exact entries in
pagetable */
+ int nchunks; /* number of lossy entries in pagetable */
+} TBMSharedIteratorState;

I think you've got this largely backwards from the most logical order.
Let's order it like this: nentries, maxentries, npages, nchunks,
pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
chunkptr, schunkbit.

+struct TBMSharedIterator
+{
+ PagetableEntry *ptbase; /* pointer to the pagetable element array */
+ dsa_area *dsa; /* reference to per-query dsa area */
+ int *ptpages; /* sorted exact page index list */
+ int *ptchunks; /* sorted lossy page index list */
+ TBMSharedIteratorState *state; /* shared members */
+ TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+};

Do we really need "dsa" here when it's already present in the shared
state? It doesn't seem like we even use it for anything. It's needed
to create each backend's TBMSharedIterator, but not afterwards, I
think.

In terms of ordering, I'd move "state" to the top of the structure,
just like "tbm" comes first in a TBMIterator.

+ * total memory consumption. If dsa passed to this function is not NULL
+ * then the memory for storing elements of the underlying page table will
+ * be allocated from the DSA.

Notice that you capitalized "DSA" in two different ways in the same
sentence; I'd go for the all-caps version. Also, you need the word
"the" before the first one.

+ if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))

Excess parentheses.

+ * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
+ * by multiple workers using shared iterator.

Prepare to iterate, not prepare to iterator. I suggest rephrasing
this as "prepare shared iteration state for a TIDBitmap".

+ * The TBMSharedIteratorState will be allocated from DSA so that multiple
+ * worker can attach to it and iterate jointly.

Maybe change to "The necessary shared state will be allocated from the
DSA passed to tbm_create, so that multiple workers can attach to it
and iterate jointly".

+ * This will convert the pagetable hash into page and chunk array of the index
+ * into pagetable array so that multiple workers can use it to get the actual
+ * page entry.

I think you can leave off everything starting from "so that". It's
basically redundant with what you already said.

+ dsa_pointer iterator;
+ TBMSharedIteratorState *iterator_state;

These aren't really great variable names, because the latter isn't the
state associated with the former. They're both the same object.
Maybe just "dp" and "istate".

I think this function should Assert(tbm->dsa != NULL) and
Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED).

+ /*
+ * If we have a hashtable, create and fill the sorted page lists, unless
+ * we already did that for a previous iterator. Note that the lists are
+ * attached to the bitmap not the iterator, so they can be used by more
+ * than one iterator. However, we keep dsa_pointer to these in the shared
+ * iterator so that other workers can access them directly.
+ */

This is mostly cut-and-pasted from tbm_begin_iterate() but it's not
really correct here now because (1) we're no longer trying to fake up
a TIDBitmap proper in every backend and (2) this code runs even if we
don't have a hashtable. I think the comment should be something like
"If we're not already iterating, create and fill the sorted page
lists. (If we are, the sorted page lists are already stored in the
TIDBitmap, and we can just reuse them.)"

+ * Create page list and chunk list using relptr so that we can share
+ * this information across multiple workers.

No relptrs any more.

+ tbm->ptpages = dsa_allocate(tbm->dsa, tbm->npages * (sizeof(int)));

Extra parentheses.

+ tbm->nchunks * (sizeof(int)));

Extra parentheses.

+ * If TBM status is TBM_HASH then iterate over the pagetable and

"If the TBM status is"...

+ * directly store it's index i.e. 0 in page array

s/it's/its/

Don't you also need some code here to handle the TBM_EMPTY case?

+ /*
+ * Store the TBM member in the shared state so that we can share them
+ * across multiple workers.
+ */
+ iterator_state->maxentries = tbm->maxentries;
+ iterator_state->nchunks = tbm->nchunks;
+ iterator_state->nentries = tbm->nentries;
+ iterator_state->npages = tbm->npages;
+ iterator_state->pagetable = tbm->dsapagetable;
+ iterator_state->spages = tbm->ptpages;
+ iterator_state->schunks = tbm->ptchunks;
+
+ /* Initialize the shared iterator state */
+ iterator_state->schunkbit = 0;
+ iterator_state->schunkptr = 0;
+ iterator_state->spageptr = 0;
+
+ /* Initialize the iterator lock */
+ LWLockInitialize(&iterator_state->lock, LWTRANCHE_PARALLEL_TBM_ITERATOR);

Set all of the structure members in the same order that you declare them.

+ * tbm_extract_page_tuple - extract the tuple offsets from the page

s/the page/a page/

+ * Process the page bits to extract the tuple offsets and store them into
+ * TBMIterateResult.

This duplicates the preceding, a bit. Maybe just "The extracted
offsets are stored into the TBMIterateResult".

+/*
+ * tbm_advance_schunkbit
+ *
+ * Advance the chunkbit to get the next page entry from the chunk
+ */

The formatting of this header comment is randomly different from the
preceding and following header comments.

I would change the argument to schunkbitp, declare a local variable
named schunkbit, and do int schunkbit = *schunkbitp at the top and
*schunkbitp = schunkbit at the bottom.

+ * As above, but this will iterate using shared iterator which is shared
+ * across multiple workers. We need to acquire the iterator LWLock, before
+ * accessing the shared members.

"using shared iterator which is shared" -> "using an iterator which is shared"
"multiple workers" -> "multiple processes"

+ PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];

Maybe &ptbase[ptchunks[state->schunkptr]] would look a little nicer.

+ PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
+ PagetableEntry *page = ptbase + ptpages[state->spageptr];
+ PagetableEntry *page = ptbase + ptpages[state->spageptr];

Similarly.

+ * tbm_end_shared_iterate - finish an iteration over a TIDBitmap

Maybe s/an iteration/a shared iteration/

+ * As above, but it frees the memory of TBMSharedIterator.

Instead of this, I'd write "This doesn't free any of the shared state
associated with the iterator, just our backend-private state."

+ PagetableEntry *lpage = ((PagetableEntry *) arg) + *((int *) left);
+ PagetableEntry *rpage = ((PagetableEntry *) arg) + *((int *) right);

Again, try to use array dereference notation rather than pointer
arithmetic where you can. Maybe:

PageTableEntry *base = (PageTableEntry *) arg;
PageTableEntry *lpage = &arg[*(int *) left];
etc.

+tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer iterator)
+{
+ TBMSharedIterator *shared_iterator;
+ TBMSharedIteratorState *shared_state;

Again, variable naming issues. Let's go with dsa_pointer dp,
TBMSharedIterator *iterator, TBMSharedIteratorState *istate. Please
try to go through and always use the same variable name for the same
kind of object. It's confusing when "iterator" sometimes means a
TBMSharedIterator * and sometimes a TBMSharedIteratorState * and
sometimes a dsa_pointer. Make it all consistent (and hopefully
logical).

+ shared_iterator = (TBMSharedIterator *) palloc(sizeof(TBMIterator) +
+ MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));

Wrong sizeof. (This isn't just a style issue - it's a bug!)

+ * Callback function for allocating the memory for hashtable elements.
+ * It allocates memory from DSA if tbm holds a reference to a dsa.

Inconsistent capitalization of "DSA" again.

Maybe change the whole comment to "Allocate memory for hashtable
elements, using DSA if available."

+ if (tbm->dsapagetable)

Test against InvalidDsaPointer. Or maybe just get rid of the "if" and
do it unconditionally; if dsapagetable is InvalidDsaPointer then
dsapagetableold will be too, so no harm done.

+ LWTRANCHE_PARALLEL_TBM_ITERATOR,

Change to LWLTRANCHE_TBM, add a call to LWLockRegisterTranche in
lwlock.c, and patch the docs with the new tranche name.

Despite this fairly extensive list of mostly-trivial cosmetic
problems, I think this is in fairly good shape now. I think the basic
design is right, so once we get these loose ends tied up this should
be good to go in.

> 0002: actual parallel bitmap heap scan by using interfaces of 0001.

This needs a lot more extensive review than I quite have the time and
energy for right now, but here are a few comments.

+ <entry><literal>ParallelBitmapScan</></entry>

ParallelBitmapPopulate, maybe? You're not waiting for the scan;
you're waiting for the bitmap to be populated.

+ <entry>Waiting for leader backend to complete the TidBitmap.</entry>

Likewise, complete -> populate?

+ /* If any limit was set to zero, the user doesn't want a parallel scan. */

If we got zero? There aren't multiple limits mentioned anywhere
nearby, so this reads oddly.

+ /*
+ * It may be possible to amortize some of the I/O cost, but probably
+ * not very much, because most operating systems already do aggressive
+ * prefetching. For now, we assume that the disk run cost can't be
+ * amortized at all.
+ */

This looks like it was cut-and-pasted from the sequential scan case,
but it doesn't really apply here. This presupposes we're relying on
OS prefetching, but in the case of a bitmap heap scan we do our own
prefetching. Maybe change to "Since we do prefetching in both the
parallel and non-parallel cases, we don't expect to get substantially
more I/O parallelism by spreading the scan across multiple backends.
So, assume the disk run cost can't be amortized at all."

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2017-02-16 20:35:06 Re: Small issue in online devel documentation build
Previous Message Robert Haas 2017-02-16 20:25:54 Re: Avoiding OOM in a hash join with many duplicate inner keys