Re: Parallel bitmap heap scan

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2017-02-18 17:15:40
Message-ID: CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbnb8eb10c-6AywJDxbWrA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 17, 2017 at 2:01 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> + * in order to share relptrs of the chunk and the pages arrays and other
> + * TBM related information with other workers.
>
> No relptrs any more.

Fixed
>
> + dsa_pointer dsapagetable; /* dsa_pointer to the element array */
> + dsa_pointer dsapagetableold; /* dsa_pointer to the old element array */
> + dsa_area *dsa; /* reference to per-query dsa area */
> + dsa_pointer ptpages; /* dsa_pointer to the page array */
> + dsa_pointer ptchunks; /* dsa_pointer to the chunk array */
>
> Let's put the DSA pointer first and then the other stuff after it.
> That seems more logical.

Done that way
>
> +typedef struct TBMSharedIteratorState
> +{
> + int spageptr; /* next spages index */
> + int schunkptr; /* next schunks index */
> + int schunkbit; /* next bit to check in current schunk */
> + LWLock lock; /* lock to protect above members */
> + dsa_pointer pagetable; /* dsa pointers to head of pagetable data */
> + dsa_pointer spages; /* dsa pointer to page array */
> + dsa_pointer schunks; /* dsa pointer to chunk array */
> + int nentries; /* number of entries in pagetable */
> + int maxentries; /* limit on same to meet maxbytes */
> + int npages; /* number of exact entries in
> pagetable */
> + int nchunks; /* number of lossy entries in pagetable */
> +} TBMSharedIteratorState;
>
> I think you've got this largely backwards from the most logical order.
> Let's order it like this: nentries, maxentries, npages, nchunks,
> pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
> chunkptr, schunkbit.

Done
>
> +struct TBMSharedIterator
> +{
> + PagetableEntry *ptbase; /* pointer to the pagetable element array */
> + dsa_area *dsa; /* reference to per-query dsa area */
> + int *ptpages; /* sorted exact page index list */
> + int *ptchunks; /* sorted lossy page index list */
> + TBMSharedIteratorState *state; /* shared members */
> + TBMIterateResult output; /* MUST BE LAST (because variable-size) */
> +};
>
> Do we really need "dsa" here when it's already present in the shared
> state? It doesn't seem like we even use it for anything. It's needed
> to create each backend's TBMSharedIterator, but not afterwards, I
> think.

Right, removed.

>
> In terms of ordering, I'd move "state" to the top of the structure,
> just like "tbm" comes first in a TBMIterator.

Yeah, that looks better. Done that way.
>
> + * total memory consumption. If dsa passed to this function is not NULL
> + * then the memory for storing elements of the underlying page table will
> + * be allocated from the DSA.
>
> Notice that you capitalized "DSA" in two different ways in the same
> sentence; I'd go for the all-caps version. Also, you need the word
> "the" before the first one.

Fixed, all such instances.

>
> + if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))
>
> Excess parentheses.
Fixed
>
> + * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
> + * by multiple workers using shared iterator.
>
> Prepare to iterate, not prepare to iterator. I suggest rephrasing
> this as "prepare shared iteration state for a TIDBitmap".

Fixed.
>
> + * The TBMSharedIteratorState will be allocated from DSA so that multiple
> + * worker can attach to it and iterate jointly.
>
> Maybe change to "The necessary shared state will be allocated from the
> DSA passed to tbm_create, so that multiple workers can attach to it
> and iterate jointly".

Done.
>
> + * This will convert the pagetable hash into page and chunk array of the index
> + * into pagetable array so that multiple workers can use it to get the actual
> + * page entry.
>
> I think you can leave off everything starting from "so that". It's
> basically redundant with what you already said.

Done
>
> + dsa_pointer iterator;
> + TBMSharedIteratorState *iterator_state;
>
> These aren't really great variable names, because the latter isn't the
> state associated with the former. They're both the same object.
> Maybe just "dp" and "istate".

Done
>
> I think this function should Assert(tbm->dsa != NULL) and
> Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
> tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED).

Done
>
> + /*
> + * If we have a hashtable, create and fill the sorted page lists, unless
> + * we already did that for a previous iterator. Note that the lists are
> + * attached to the bitmap not the iterator, so they can be used by more
> + * than one iterator. However, we keep dsa_pointer to these in the shared
> + * iterator so that other workers can access them directly.
> + */
>
> This is mostly cut-and-pasted from tbm_begin_iterate() but it's not
> really correct here now because (1) we're no longer trying to fake up
> a TIDBitmap proper in every backend and (2) this code runs even if we
> don't have a hashtable. I think the comment should be something like
> "If we're not already iterating, create and fill the sorted page
> lists. (If we are, the sorted page lists are already stored in the
> TIDBitmap, and we can just reuse them.)"

Done
>
> + * Create page list and chunk list using relptr so that we can share
> + * this information across multiple workers.
>
> No relptrs any more.

Done
>
> + tbm->ptpages = dsa_allocate(tbm->dsa, tbm->npages * (sizeof(int)));
>
> Extra parentheses.
>
> + tbm->nchunks * (sizeof(int)));
>
> Extra parentheses.
Fixed
>
> + * If TBM status is TBM_HASH then iterate over the pagetable and
>
> "If the TBM status is"...
>
> + * directly store it's index i.e. 0 in page array
>
> s/it's/its/
Done
>
> Don't you also need some code here to handle the TBM_EMPTY case?,

IMHO, we don't need to do anything for TBM_EMPTY because npages and
nchunks will be zero so iterator will handle such cases (same as it's
done for non-parallel case.)

>
> + /*
> + * Store the TBM member in the shared state so that we can share them
> + * across multiple workers.
> + */
> + iterator_state->maxentries = tbm->maxentries;
> + iterator_state->nchunks = tbm->nchunks;
> + iterator_state->nentries = tbm->nentries;
> + iterator_state->npages = tbm->npages;
> + iterator_state->pagetable = tbm->dsapagetable;
> + iterator_state->spages = tbm->ptpages;
> + iterator_state->schunks = tbm->ptchunks;
> +
> + /* Initialize the shared iterator state */
> + iterator_state->schunkbit = 0;
> + iterator_state->schunkptr = 0;
> + iterator_state->spageptr = 0;
> +
> + /* Initialize the iterator lock */
> + LWLockInitialize(&iterator_state->lock, LWTRANCHE_PARALLEL_TBM_ITERATOR);
>
> Set all of the structure members in the same order that you declare them.

Done.
>
> + * tbm_extract_page_tuple - extract the tuple offsets from the page
>
> s/the page/a page/
>
> + * Process the page bits to extract the tuple offsets and store them into
> + * TBMIterateResult.
>
> This duplicates the preceding, a bit. Maybe just "The extracted
> offsets are stored into the TBMIterateResult".

Done
>
> +/*
> + * tbm_advance_schunkbit
> + *
> + * Advance the chunkbit to get the next page entry from the chunk
> + */
>
> The formatting of this header comment is randomly different from the
> preceding and following header comments.

Seems like different function in the file are using different
formatting. But done as near by functions.
>
> I would change the argument to schunkbitp, declare a local variable
> named schunkbit, and do int schunkbit = *schunkbitp at the top and
> *schunkbitp = schunkbit at the bottom.
>

Fixed

> + * As above, but this will iterate using shared iterator which is shared
> + * across multiple workers. We need to acquire the iterator LWLock, before
> + * accessing the shared members.
>
> "using shared iterator which is shared" -> "using an iterator which is shared"
> "multiple workers" -> "multiple processes"

Done
>
> + PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
>
> Maybe &ptbase[ptchunks[state->schunkptr]] would look a little nicer.
>
> + PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr];
> + PagetableEntry *page = ptbase + ptpages[state->spageptr];
> + PagetableEntry *page = ptbase + ptpages[state->spageptr];
>
> Similarly.

Done
>
> + * tbm_end_shared_iterate - finish an iteration over a TIDBitmap
>
> Maybe s/an iteration/a shared iteration/
>
> + * As above, but it frees the memory of TBMSharedIterator.
>
> Instead of this, I'd write "This doesn't free any of the shared state
> associated with the iterator, just our backend-private state."

Done
>
> + PagetableEntry *lpage = ((PagetableEntry *) arg) + *((int *) left);
> + PagetableEntry *rpage = ((PagetableEntry *) arg) + *((int *) right);
>
> Again, try to use array dereference notation rather than pointer
> arithmetic where you can. Maybe:
>
> PageTableEntry *base = (PageTableEntry *) arg;
> PageTableEntry *lpage = &arg[*(int *) left];
> etc.
Done

>
> +tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer iterator)
> +{
> + TBMSharedIterator *shared_iterator;
> + TBMSharedIteratorState *shared_state;
>
> Again, variable naming issues. Let's go with dsa_pointer dp,
> TBMSharedIterator *iterator, TBMSharedIteratorState *istate. Please
> try to go through and always use the same variable name for the same
> kind of object. It's confusing when "iterator" sometimes means a
> TBMSharedIterator * and sometimes a TBMSharedIteratorState * and
> sometimes a dsa_pointer. Make it all consistent (and hopefully
> logical).
Done
>
> + shared_iterator = (TBMSharedIterator *) palloc(sizeof(TBMIterator) +
> + MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
>
> Wrong sizeof. (This isn't just a style issue - it's a bug!)

Right, it's a bug. Fixed.
>
> + * Callback function for allocating the memory for hashtable elements.
> + * It allocates memory from DSA if tbm holds a reference to a dsa.
>
> Inconsistent capitalization of "DSA" again.
>
> Maybe change the whole comment to "Allocate memory for hashtable
> elements, using DSA if available."

Fixe
>
> + if (tbm->dsapagetable)
>
> Test against InvalidDsaPointer. Or maybe just get rid of the "if" and
> do it unconditionally; if dsapagetable is InvalidDsaPointer then
> dsapagetableold will be too, so no harm done.

Right
>
> + LWTRANCHE_PARALLEL_TBM_ITERATOR,
>
> Change to LWLTRANCHE_TBM, add a call to LWLockRegisterTranche in
> lwlock.c, and patch the docs with the new tranche name.

Fixed
>
> Despite this fairly extensive list of mostly-trivial cosmetic
> problems, I think this is in fairly good shape now. I think the basic
> design is right, so once we get these loose ends tied up this should
> be good to go in.
>
>> 0002: actual parallel bitmap heap scan by using interfaces of 0001.
>
> This needs a lot more extensive review than I quite have the time and
> energy for right now, but here are a few comments.
>
> + <entry><literal>ParallelBitmapScan</></entry>
>
> ParallelBitmapPopulate, maybe? You're not waiting for the scan;
> you're waiting for the bitmap to be populated.
Done
>
> + <entry>Waiting for leader backend to complete the TidBitmap.</entry>
>
> Likewise, complete -> populate?
Done
>
> + /* If any limit was set to zero, the user doesn't want a parallel scan. */
>
> If we got zero? There aren't multiple limits mentioned anywhere
> nearby, so this reads oddly.

Removed
>
> + /*
> + * It may be possible to amortize some of the I/O cost, but probably
> + * not very much, because most operating systems already do aggressive
> + * prefetching. For now, we assume that the disk run cost can't be
> + * amortized at all.
> + */
>
> This looks like it was cut-and-pasted from the sequential scan case,
> but it doesn't really apply here. This presupposes we're relying on
> OS prefetching, but in the case of a bitmap heap scan we do our own
> prefetching. Maybe change to "Since we do prefetching in both the
> parallel and non-parallel cases, we don't expect to get substantially
> more I/O parallelism by spreading the scan across multiple backends.
> So, assume the disk run cost can't be amortized at all."

Removed this comment.

Apart from these, there are some more changes.
in 0001:
- Improved the comments for TBMSharedIteratorState and TBMSharedIterator.
- Added error handling for return pointer from dsa_allocate.
- Moved function declaration in tidbitmap.h (functionaly related are
kept together)

in 0002:
- Improved comments.
- Code refactoring in BitmapHeapNext.
- Removed local tbm creation in BitmapHeapNext : as per new tidbitmap
it's of no use.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
gather_shutdown_children_first.patch application/octet-stream 1.0 KB
0001-tidbitmap-support-shared-v4.patch application/octet-stream 23.8 KB
0002-parallel-bitmap-heapscan-v4.patch application/octet-stream 34.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-02-18 17:19:25 Re: Instability in select_parallel regression test
Previous Message Tom Lane 2017-02-18 17:10:42 Suppressing that pesky warning with older flex versions