Re: Parallel Seq Scan

From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Jeff Davis <pgsql(at)j-davis(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-09-22 07:14:11
Message-ID: CAJrrPGfvkpMqXcOr-xnWSCX4pbVVDaVY_c_R8H2UcRODagwgMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 19, 2015 at 1:45 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 18, 2015 at 4:03 AM, Haribabu Kommi
> <kommi(dot)haribabu(at)gmail(dot)com> wrote:
>> On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>>
>>> Attached, find the rebased version of patch.
>>
>> Here are the performance test results:
>
> Thanks, this is really interesting. I'm very surprised by how much
> kernel overhead this shows. I wonder where that's coming from. The
> writes to and reads from the shm_mq shouldn't need to touch the kernel
> at all except for page faults; that's why I chose this form of IPC.
> It could be that the signals which are sent for flow control are
> chewing up a lot of cycles, but if that's the problem, it's not very
> clear from here. copy_user_generic_string doesn't sound like
> something related to signals. And why all the kernel time in
> _spin_lock? Maybe perf -g would help us tease out where this kernel
> time is coming from.

copy_user_generic_string system call is because of file read operations.
In my test, I gave the shared_buffers as 12GB with the table size of 18GB.

To reduce the user of copy_user_generic_string by loading all the pages into
shared buffers with different combinations of 12GB and 20GB shared_buffers
settings.

The _spin_lock calls are from the signals that are generated by the workers.
With the increase of tuple queue size, there is a change in kernel system
calls usage.

Here I attached the perf reports collected for your reference with -g option.

> Some of this may be due to rapid context switching. Suppose the
> master process is the bottleneck. Then each worker will fill up the
> queue and go to sleep. When the master reads a tuple, the worker has
> to wake up and write a tuple, and then it goes back to sleep. This
> might be an indication that we need a bigger shm_mq size. I think
> that would be experimenting with: if we double or quadruple or
> increase by 10x the queue size, what happens to performance?

I tried with 1, 2, 4, 8 and 10 multiply factor for the tuple queue
size and collected
the performance readings. Summary of the results are:

- There is not much change in low selectivity cases with the increase
of tuple queue size.

- Till 1.5 million selectivity, the time taken to execute a query is 8
workers < 4 workers < 2 workers
with any tuple queue size.

- with tuple queue multiply factor 4 (i.e 4 * tuple queue size) for
selectivity greater than 1.5 million
4 workers < 2 workers < 8 workers

- with tuple queue multiply factor 8 or 10 for selectivity greater
than 1.5 million
2 workers < 4 workers < 8 workers

- From the above performance readings, increase of tuple queue size
gets benefited with lesser
number of workers compared to higher number of workers.

- May be the tuple queue size can be calculated automatically based on
the selectivity,
average tuple width and number of workers.

- when the buffers are loaded into shared_buffers using prewarm
utility, there is not much scaling
is visible with the increase of workers.

Performance report is attached for your reference.

Apart from the performance, I have the following observations.

Workers are getting started irrespective of the system load. If user
configures 16 workers, but
because of a sudden increase in the system load, there are only 2 or 3
cpu's are only IDLE.
In this case, if any parallel seq scan eligible query is executed, the
backend may start 16 workers
thus it can lead to overall increase of system usage and may decrease
the performance of the
other backend sessions?

If the query have two parallel seq scan plan nodes and how the workers
will be distributed across
the two nodes? Currently parallel_seqscan_degree is used per plan
node, even if we change that
to per query, I think we need a worker distribution logic, instead of
using all workers by a single
plan node.

Select with a limit clause is having a performance drawback with
parallel seq scan in some scenarios,
because of very less selectivity compared to seq scan, it should be
better if we document it. Users
can take necessary actions based on that for the queries with limit clause.

Regards,
Hari Babu
Fujitsu Australia

Attachment Content-Type Size
Parallel_seqscan_perf_test.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 19.7 KB
perf_reports.zip application/zip 517.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Albe Laurenz 2015-09-22 08:28:49 Re: [HACKERS] pgsql: Use gender-neutral language in documentation
Previous Message Peter Geoghegan 2015-09-22 04:55:44 Re: [COMMITTERS] pgsql: Use gender-neutral language in documentation