Re: [WIP] speeding up GIN build with parallel workers

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: "Constantin S(dot) Pan" <kvapen(at)gmail(dot)com>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] speeding up GIN build with parallel workers
Date: 2016-02-17 20:01:47
Message-ID: CAF4Au4yhYB1wUtBa=ETDcNi2FzKgC8X8NOGRbqVhaY9hFDCUdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 17, 2016 at 6:55 PM, Constantin S. Pan <kvapen(at)gmail(dot)com> wrote:

> On Sat, 16 Jan 2016 01:38:39 +0300
> "Constantin S. Pan" <kvapen(at)gmail(dot)com> wrote:
>
> > The task of building GIN can require lots of time and eats 100 % CPU,
> > but we could easily make it use more than a 100 %, especially since we
> > now have parallel workers in postgres.
> >
> > The process of building GIN looks like this:
> >
> > 1. Accumulate a batch of index records into an rbtree in maintenance
> > work memory.
> >
> > 2. Dump the batch to disk.
> >
> > 3. Repeat.
> >
> > I have a draft implementation which divides the whole process between
> > N parallel workers, see the patch attached. Instead of a full scan of
> > the relation, I give each worker a range of blocks to read.
> >
> > This speeds up the first step N times, but slows down the second one,
> > because when multiple workers dump item pointers for the same key,
> > each of them has to read and decode the results of the previous one.
> > That is a huge waste, but there is an idea on how to eliminate it.
> >
> > When it comes to dumping the next batch, a worker does not do it
> > independently. Instead, it (and every other worker) sends the
> > accumulated index records to the parent (backend) in ascending key
> > order. The backend, which receives the records from the workers
> > through shared memory, can merge them and dump each of them once,
> > without the need to reread the records N-1 times.
> >
> > In current state the implementation is just a proof of concept
> > and it has all the configuration hardcoded, but it already works as
> > is, though it does not speed up the build process more than 4 times
> > on my configuration (12 CPUs). There is also a problem with temporary
> > tables, for which the parallel mode does not work.
>
> Hey Hackers!
>
> I have made some progress on the proposal (see the attached patch):
>
> 0. Moved some repeated code to functions (e.g. ginDumpAccumulator,
> ginbuildCommon).
>
> 1. Implemented results merging on backend.
>
> 2. Disabled the usage of parallel mode when creating index on temporary
> tables. No point in using parallel mode for temporary tables anyway,
> right?
>
> 3. Added GUC parameter to control the number of workers for GIN
> building.
>
> 4. Hit the 8x speedup limit. Made some analysis of the reasons (see the
> attached plot or the data file).
>
> In order to analyze the performance issues, I have made the following:
>
> create table t (k int, v int[]);
>
> create or replace
> function randarray(width int, low int, high int)
> returns int[] as
> $$
> select array(select (random()*(high-low) + low)::int
> from generate_series(1,width))
> $$ language sql;
>
> insert into t select k, randarray(3000, 0, 100000)
> from generate_series(1, 100000) as k;
>
> create index t_v_idx on t using gin (v);
>
> This creates 100000 arrays of 3000 random numbers each. The random
> numbers are in range [0, 100000]. Then I measure how long the gin
> building steps take. There are two steps: scan and merge.
>
> The results show that 'scan' step is sped up perfectly. But the
> 'merge' step takes longer as you increase the number of workers. The
> profiler shows that the bottleneck here is ginMergeItemPointers(), which
> I use to merge the results.
>
> Also, I did encounter the problem with workers deadlocking during
> heap_open, but that seems to have been resolved by Robert Haas in his
> commit regarding group locking.
>
> Please leave your feedback!
>

My feedback is (Mac OS X 10.11.3)

set gin_parallel_workers=2;
create index message_body_idx on messages using gin(body_tsvector);
LOG: worker process: parallel worker for PID 5689 (PID 6906) was
terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: FATAL: the
database system is in recovery mode
Failed.

>
> Regards,
>
> Constantin S. Pan
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2016-02-17 20:26:05 Re: [WIP] speeding up GIN build with parallel workers
Previous Message Álvaro Hernández Tortosa 2016-02-17 18:41:52 Re: [HACKERS] Packaging of postgresql-jdbc