Re: Parallel bitmap heap scan

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2017-02-01 17:39:51
Message-ID: CA+TgmobBP7=Tnk8cL-mz4QZupwNbN3aFcY7N2eNZbrSBxDBK0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 31, 2017 at 6:05 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> 0002-hash-support-alloc-free-v14.patch:
>>
>>
>> + if (tb->alloc)
>> + {
>>
>> The memory for tb->alloc is allocated always, is the if check still
>> required?
>
> In parallel case, only first worker will call SH_CREATE, other worker
> will only do palloc for pagetable and copy the reference from main
> worker, so they will not have allocator.

When Andres wrote in
https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
that you should try to share only the iteration arrays, I believe that
he meant the results of tbm_begin_iterate() -- that is,
iterator->spageptr, chunkptr, schunkbit, spages, and schunks. I'm
perhaps putting words into his mouth, but what he said was that we
could avoid sharing "the whole underlying hash". But the patch set
you've implemented here doesn't behave that way. Instead, you've got
the array containing the hash table elements shared (which is good)
plus there's a sort of dummy hash table in every worker which copies
some but not all of the members of the original hash table, leading to
the otherwise-unnecessary if-test that Haribabu is complaining about.
So the hash table is kinda-shared-kinda-not-shared, which I don't
*think* is really what Andres had in mind.

In short, I think SH_COPY (implemented here as pagetable_copy) needs
to go away. The work of tbm_begin_iterate() should be done before we
begin the shared scan and the results of that work should be shared.
What happens right now (it appears) is that every backend does that
work based on the same hash table and we just assume they all get the
same answer. And we really do assume that, because
pbms_parallel_iterate() assumes it can shuttle private state back and
forth between iterator in different backends and nothing will break;
but those backends aren't actually using the same iteration arrays.
They are using different iteration arrays that should have the same
contents because they were all derived from the same semi-shared hash
table. That's pretty fragile, and a waste of CPU cycles if the hash
table is large (every backend does its own sort).

On a related note, I think it's unacceptable to make the internal
details of TBMIterator public. You've moved it from tidbitmap.c to
tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
the TBMIterator, but that's not OK. Those details need to remain
private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we
need some better solution. The knowledge of how a shared iterator
should iterate needs to be private to tidbitmap.c, not off in the
executor someplace. And I think the entries need to actually be
updated in shared memory directly, not copied back and forth between a
backend-private iterator and a shared iterator.

Also, pbms_parallel_iterate() can't hold a spinlock around a call to
tbm_iterate(). Note the coding rules for spinlocks mentioned in
spin.h and src/backend/storage/lmgr/README. I think the right thing
to do here is to use an LWLock in a new tranche (add an entry to
BuiltinTrancheIds).

In 0002, you have this:

+ tb->alloc = palloc(sizeof(SH_ALLOCATOR));

This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
Otherwise the allocator object is in a different context from
everything else in the hash table. But TBH, I don't really see why we
want this to be a separate object. Why not just put
HashAlloc/HashFree/args into SH_TYPE directly? That avoids some
pointer chasing and doesn't really seem to cost anything (except that
SH_CREATE will grow a slightly longer argument sequence).

I think maybe we should rename the functions to element_allocate,
element_free, and element_allocator_ctx, or something like that. The
current names aren't capitalized consistently with other things in
this module, and putting the word "element" in there would make it
more clear what the purpose of this thing is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?= 2017-02-01 17:45:10 [PATCH] Add tab completion for DEALLOCATE
Previous Message REIX, Tony 2017-02-01 17:30:16 Re: Deadlock in XLogInsert at AIX