Re: Parallel bitmap heap scan

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2016-11-23 07:01:24
Message-ID: CAFiTN-ug-SmVrb_nF2s73WE0biNULsHXSU+s3kY2SCrQBdodRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 23, 2016 at 7:24 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> So, I had a brief look at this tonight. This is not a full review,
> but just some things I noticed:

Thanks for the review..

>
> + * Update snpashot info in heap scan descriptor.
>
> Typo. Also, why should we have a function for this at all? And if we
> do have a function for this, why should it have "bm" in the name when
> it's stored in heapam.c?
We are updating snapshot in HeapScanDesc, that's the reason I am using
function and kept in heapam.c file like other function
heap_beginscan_bm.
But I think we can change it's name bm is not required in this, this
function don't do anything specific for bm. I will change this.

>
> + * [PARALLEL BITMAP HEAP SCAN ALGORITHM]
> + *
> + * #1. Shared TIDBitmap creation and initialization
> + * a) First worker to see the state as parallel bitmap info as
> + * PBM_INITIAL become leader and set the state to PBM_INPROGRESS
> + * All other workers see the state as PBM_INPROGRESS will wait for
> + * leader to complete the TIDBitmap.
> + *
> + * Leader Worker Processing:
> + * (Leader is responsible for creating shared TIDBitmap and create
> + * shared page and chunk array from TIDBitmap.)
> + * 1) Create TIDBitmap using DHT.
> + * 2) Begin Iterate: convert hash table into shared page and chunk
> + * array.
> + * 3) Restore local TIDBitmap variable information into
> + * ParallelBitmapInfo so that other worker can see those.
> + * 4) set state to PBM_FINISHED.
> + * 5) Wake up other workers.
> + *
> + * Other Worker Processing:
> + * 1) Wait until leader create shared TIDBitmap and shared page
> + * and chunk array.
> + * 2) Attach to shared page table, copy TIDBitmap from
> + * ParallelBitmapInfo to local TIDBitmap, we copy this to local
> + * TIDBitmap so that next level processing can read information
> + * same as in non parallel case and we can avoid extra changes
> + * in code.
> + *
> + * # At this level TIDBitmap is ready and all workers are awake #
> + *
> + * #2. Bitmap processing (Iterate and process the pages).
> + * . In this phase each worker will iterate over page and
> chunk array and
> + * select heap pages one by one. If prefetch is enable then there will
> + * be two iterator.
> + * . Since multiple worker are iterating over same page and chunk array
> + * we need to have a shared iterator, so we grab a spin lock and iterate
> + * within a lock.
>
> The formatting of this comment is completely haphazard.

I will fix this..

"Leader
> worker" is not a term that has any meaning. A given backend involved
> in a parallel query is either a leader or a worker, not both.

I agree this is confusing, but we can't call it directly a leader
because IMHO we meant by a leader who, actually spawns all worker and
gather the results. But here I meant by "leader worker" is the worker
which takes responsibility of building tidbitmap, which can be any
worker. So I named it "leader worker". Let me think what we can call
it.

>
> + /* reset parallel bitmap scan info, if present */
> + if (node->parallel_bitmap)
> + {
> + ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
> +
> + pbminfo->state = PBM_INITIAL;
> + pbminfo->tbmiterator.schunkbit = 0;
> + pbminfo->tbmiterator.spageptr = 0;
> + pbminfo->tbmiterator.schunkptr = 0;
> + pbminfo->prefetch_iterator.schunkbit = 0;
> + pbminfo->prefetch_iterator.spageptr = 0;
> + pbminfo->prefetch_iterator.schunkptr = 0;
> + pbminfo->prefetch_pages = 0;
> + pbminfo->prefetch_target = -1;
> + }
>
> This is obviously not going to work in the face of concurrent
> activity. When we did Parallel Seq Scan, we didn't make any changes
> to the rescan code at all. I think we are assuming that there's no
> way to cause a rescan of the driving table of a parallel query; if
> that's wrong, we'll need some fix, but this isn't it. I would just
> leave this out.

In heap_rescan function we have similar change..

if (scan->rs_parallel != NULL)
{
parallel_scan = scan->rs_parallel;
SpinLockAcquire(&parallel_scan->phs_mutex);
parallel_scan->phs_cblock = parallel_scan->phs_startblock;
SpinLockRelease(&parallel_scan->phs_mutex);
}

And this is not for driving table, it's required for the case when
gather is as inner node for JOIN.
consider below example. I know it's a bad plan but we can produce this plan :)

Outer Node Inner Node
SeqScan(t1) NLJ (Gather -> Parallel SeqScan (t2)).

Reason for doing same is that, during ExecReScanGather we don't
recreate new DSM instead of that we just reinitialise the DSM
(ExecParallelReinitialize).

>
> +static bool
> +pbms_is_leader(ParallelBitmapInfo *pbminfo)
>
> I think you should see if you can use Thomas Munro's barrier stuff for
> this instead.

Okay, I will think over it.

>
> + SerializeSnapshot(estate->es_snapshot, pbminfo->phs_snapshot_data);
> +
> + shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pbminfo);
> +
> + node->parallel_bitmap = pbminfo;
> + snapshot = RestoreSnapshot(pbminfo->phs_snapshot_data);
> +
> + heap_bm_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
>
> This doesn't make any sense. You serialize the snapshot from the
> estate, then restore it, then shove it into the scan descriptor. But
> presumably that's already the snapshot the scan descriptor is using.
> The workers need to do this, perhaps, but not the leader!

This is wrong, will fix this.

>
> + dht_parameters params = {0};
>
> Not PostgreSQL style.

I will fix..
>
> +#define TBM_IS_SHARED(tbm) (tbm)->shared
>
> Seems pointless.
Ok..
>
> + bool shared; /* need to build shared tbm if set*/
>
> Style.

Ok.
>
> + params.tranche_id = LWLockNewTrancheId();
>
> You absolutely, positively cannot burn through tranche IDs like this.
>
> + if (tbm->shared_pagetable)
> + dht_detach(tbm->shared_pagetable);
>
> Hmm, would we leak references if we errored out?

I will check on this part. Anyway, In my next version I am working on
making my patch independent of DHT, so this part will be removed.

>
> @@ -47,7 +47,6 @@ typedef enum
>
> static List *translate_sub_tlist(List *tlist, int relid);
>
> -
> /*****************************************************************************
> * MISC. PATH UTILITIES
> *****************************************************************************/
>
> Useless whitespace change.
>
> @@ -23,7 +23,6 @@
> #include "utils/relcache.h"
> #include "utils/snapshot.h"
>
> -
> /* "options" flag bits for heap_insert */
> #define HEAP_INSERT_SKIP_WAL 0x0001
> #define HEAP_INSERT_SKIP_FSM 0x0002
>
> Useless whitespace change.
>
> WAIT_EVENT_MQ_RECEIVE,
> WAIT_EVENT_MQ_SEND,
> WAIT_EVENT_PARALLEL_FINISH,
> + WAIT_EVENT_PARALLEL_BITMAP_SCAN,
> WAIT_EVENT_SAFE_SNAPSHOT,
> WAIT_EVENT_SYNC_REP
>
> Missing a documentation update.

I will fix these, in next version.
>
> In general, the amount of change in nodeBitmapHeapScan.c seems larger
> than I would have expected. My copy of that file has 655 lines; this
> patch adds 544 additional lines. I think/hope that some of that can
> be simplified away.

I will work on this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vladimir Svedov 2016-11-23 08:08:53 Re: postgres 9.3 postgres_fdw ::LOG: could not receive data from client: Connection reset by peer
Previous Message Pavel Stehule 2016-11-23 06:12:00 Re: IF (NOT) EXISTS in psql-completion