Re: Proposal: speeding up GIN build with parallel workers

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: "Constantin S(dot) Pan" <kvapen(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: speeding up GIN build with parallel workers
Date: 2016-01-17 21:38:45
Message-ID: CAM3SWZT5LPih7a3VZFeDO+VeU5N8zTEgEVQ7rDxRaan2eofXzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 17, 2016 at 12:03 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> I think it would take a lot of changes to tuple sort to make this be a
> almost-always win.
>
> In the general case each GIN key occurs in many tuples, and the
> in-memory rbtree is good at compressing the tid list for each key to
> maximize the amount of data that can be in memory (although perhaps it
> could be even better, as it doesn't use varbyte encoding on the tid
> list the way the posting lists on disk do--it could do so in the bulk
> build case, where it receives the tid in order, but not feasibly in
> the pending-list clean-up case, where it doesn't get them in order)

Interesting analysis.

> When I was testing building an index on a column of text identifiers
> where there were a couple million identifiers occurring a few dozen
> times each, building it as a gin index (using btree_gin so that plain
> text columns could be indexed) was quite a bit faster than building it
> as a regular btree index using tuple sort. I didn't really
> investigate this difference, but I assume it was due to the better
> memory usage leading to less IO.
>
> (Disclaimer: this was on those identifiers which had little difference
> in the first 8 bytes, so they didn't really benefit from the
> abbreviated keys.)

Sorry for going on about it, but I think you'd be surprised how
effective abbreviated keys are even in seemingly marginal cases.

> So I agree with that a better long term approach would probably be to
> make gin index builds (at least the bulk ones, I don't know about the
> pending-list-cleanup) to use a tuple sort. But it looks like
> Constantin has already done most of the work on his current proposal,
> and no one has volunteered to do the work on converting it to tuple
> sort instead.

I'm not going to stand in the way of incremental progress,
particularly given that this looks to be a simple patch that doesn't
commit us to anything. I suspect that we should consolidate GIN and
nbtree at some point, though. I think that there are some useful
general consequences for performance that come from consolidation. For
example, my ongoing work on external sorting makes it use much of the
same infrastructure as internal sorting. Now, external sorts
automatically benefit from optimizations to internal sorting, like the
"onlyKey" quicksort optimization.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2016-01-17 22:02:34 Re: WIP: Covering + unique indexes.
Previous Message Tom Lane 2016-01-17 20:13:30 Do we need SQL-level access to amoptions functions?