Re: Proposal: speeding up GIN build with parallel workers

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc: "Constantin S(dot) Pan" <kvapen(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: speeding up GIN build with parallel workers
Date: 2016-09-14 06:48:43
Message-ID: 0a0df4f6-66bb-aeaf-c7a7-60d6d4abaabb@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/17/2016 10:03 PM, Jeff Janes wrote:
> On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen(at)gmail(dot)com> wrote:
>>> I have a draft implementation which divides the whole process between
>>> N parallel workers, see the patch attached. Instead of a full scan of
>>> the relation, I give each worker a range of blocks to read.
>>
>> I am currently working on a patch that allows B-Tree index builds to
>> be performed in parallel. I think I'm a week or two away from posting
>> it.
>>
>> Even without parallelism, wouldn't it be better if GIN indexes were
>> built using tuplesort? I know way way less about the gin am than the
>> nbtree am, but I imagine that a prominent cost for GIN index builds is
>> constructing the main B-Tree (the one that's constructed over key
>> values) itself. Couldn't tuplesort.c be adapted to cover this case?
>> That would be much faster in general, particularly with the recent
>> addition of abbreviated keys, while also leaving a clear path forward
>> to performing the build in parallel.
>
> I think it would take a lot of changes to tuple sort to make this be a
> almost-always win.
>
> In the general case each GIN key occurs in many tuples, and the
> in-memory rbtree is good at compressing the tid list for each key to
> maximize the amount of data that can be in memory (although perhaps it
> could be even better, as it doesn't use varbyte encoding on the tid
> list the way the posting lists on disk do--it could do so in the bulk
> build case, where it receives the tid in order, but not feasibly in
> the pending-list clean-up case, where it doesn't get them in order)
>
> When I was testing building an index on a column of text identifiers
> where there were a couple million identifiers occurring a few dozen
> times each, building it as a gin index (using btree_gin so that plain
> text columns could be indexed) was quite a bit faster than building it
> as a regular btree index using tuple sort. I didn't really
> investigate this difference, but I assume it was due to the better
> memory usage leading to less IO.

I've been wondering about this too, using tuplesort to build GIN
indexes, for a long time. Surely the highly-optimized quicksort+merge
code in tuplesort.c is faster than maintaining a red-black tree? Or so I
thought.

I wrote a quick prototype of using tuplesort.c for GIN index build. I
tested it with a 500 MB table of integer arrays, so that the sort fit
completely in memory. It's a lot slower than the current code. It turns
out eliminating the duplicates early is really really important.

So we want to keep the red-black tree, to eliminate the duplicates. Or
add that capability to tuplesort.c, which might speed up Sort+Unique and
aggregates in general, but that's a big effort.

However, I still wonder, if we shouldn't use a merge approach when the
tree doesn't fit in memory, like tuplesort.c does. Currently, when the
tree is full, we flush it out to the index, by inserting all the
entries. That can get quite expensive, I/O-wise. It also generates more
WAL, compared to writing each page only once.

If we flushed the tree to a tape instead, then we could perhaps use the
machinery that Peter's parallel B-tree patch is adding to tuplesort.c,
to merge the tapes. I'm not sure if that works out, but I think it's
worth some experimentation.

> But I do think this with aggregation would be worth it despite the
> amount of work involved. In addition to automatically benefiting from
> parallel sorts and any other future sort improvements, I think it
> would also generate better indexes. I have a theory that a problem
> with very large gin indexes is that repeatedly building
> maintenance_work_mem worth of rbtree entries and then merging them to
> disk yields highly fragmented btrees, in which logical adjacent
> key-space does not occupy physically nearby leaf pages, which then can
> causes problems either with access that follows a pattern (like range
> scans, except gin indexes can't really do those anyway) or further
> bulk operations.

Yeah, there's that too.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2016-09-14 07:38:19 Re: Speed up Clog Access by increasing CLOG buffers
Previous Message Michael Paquier 2016-09-14 06:46:58 Re: kqueue