Re: Parallel Seq Scan

From: Thom Brown <thom(at)linux(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-11-11 23:57:09
Message-ID: CAA-aLv5ykvV=gxi9Xv25sx3aVkWKtHL3NLxZ8v2OS0fUPndgAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11 November 2015 at 19:51, Thom Brown <thom(at)linux(dot)com> wrote:
> On 11 November 2015 at 19:26, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Wed, Nov 11, 2015 at 12:59 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>>> I have a first query
>>>
>>> I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
>>> differen
>>
>> Hmm, I see I was right about people finding more bugs once this was
>> committed. That didn't take long.
>>
>> There's supposed to be code to handle this - see the
>> SharedPlanStateInstrumentation stuff in execParallel.c - but it's
>> evidently a few bricks shy of a load.
>> ExecParallelReportInstrumentation is supposed to transfer the counts
>> from each worker to the DSM:
>>
>> ps_instrument = &instrumentation->ps_instrument[i];
>> SpinLockAcquire(&ps_instrument->mutex);
>> InstrAggNode(&ps_instrument->instr, planstate->instrument);
>> SpinLockRelease(&ps_instrument->mutex);
>>
>> And ExecParallelRetrieveInstrumentation is supposed to slurp those
>> counts back into the leader's PlanState objects:
>>
>> /* No need to acquire the spinlock here; workers have exited already. */
>> ps_instrument = &instrumentation->ps_instrument[i];
>> InstrAggNode(planstate->instrument, &ps_instrument->instr);
>>
>> This might be a race condition, or it might be just wrong logic.
>> Could you test what happens if you insert something like a 1-second
>> sleep in ExecParallelFinish just after the call to
>> WaitForParallelWorkersToFinish()? If that makes the results
>> consistent, this is a race. If it doesn't, something else is wrong:
>> then it would be useful to know whether the workers are actually
>> calling ExecParallelReportInstrumentation, and whether the leader is
>> actually calling ExecParallelRetrieveInstrumentation, and if so
>> whether they are doing it for the correct set of nodes.
>
> Hmm.. I made the change, but clearly it's not sleeping properly with
> my change (I'm expecting a total runtime in excess of 1 second):
>
> max_parallel_degree = 4:
>
> # explain (analyse, buffers, timing, verbose, costs) select count(*)
> from js where content->'tags'->0->>'term' like 'design%' or
> content->'tags'->0->>'term' like 'web%';
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=49578.18..49578.19 rows=1 width=0) (actual
> time=797.518..797.518 rows=1 loops=1)
> Output: count(*)
> Buffers: shared hit=174883 read=540
> -> Gather (cost=1000.00..49546.93 rows=12500 width=0) (actual
> time=0.245..784.959 rows=55151 loops=1)
> Output: content
> Number of Workers: 4
> Buffers: shared hit=174883 read=540
> -> Parallel Seq Scan on public.js (cost=0.00..47296.93
> rows=12500 width=0) (actual time=0.019..6153.679 rows=94503 loops=1)
> Output: content
> Filter: (((((js.content -> 'tags'::text) -> 0) ->>
> 'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
> -> 0) ->> 'term'::text) ~~ 'web%'::text))
> Rows Removed by Filter: 2051330
> Buffers: shared hit=299224 read=907
> Planning time: 0.086 ms
> Execution time: 803.026 ms
>
>
> max_parallel_degree = 0:
>
> # explain (analyse, buffers, timing, verbose, costs) select count(*)
> from js where content->'tags'->0->>'term' like 'design%' or
> content->'tags'->0->>'term' like 'web%';
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=212867.43..212867.44 rows=1 width=0) (actual
> time=1278.717..1278.717 rows=1 loops=1)
> Output: count(*)
> Buffers: shared hit=174671 read=572
> -> Seq Scan on public.js (cost=0.00..212836.18 rows=12500
> width=0) (actual time=0.018..1272.030 rows=55151 loops=1)
> Output: content
> Filter: (((((js.content -> 'tags'::text) -> 0) ->>
> 'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
> -> 0) ->> 'term'::text) ~~ 'web%'::text))
> Rows Removed by Filter: 1197822
> Buffers: shared hit=174671 read=572
> Planning time: 0.064 ms
> Execution time: 1278.741 ms
> (10 rows)
>
> Time: 1279.145 ms
>
>
> I did, however, notice that repeated runs of the query with
> max_parallel_degree = 4 yields different counts of rows removed by
> filter:
>
> Run 1: 2051330
> Run 2: 2081252
> Run 3: 2065112
> Run 4: 2022045
> Run 5: 2025384
> Run 6: 2059360
> Run 7: 2079620
> Run 8: 2058541

Here's another oddity, with max_parallel_degree = 1:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->>'title' like '%design%';
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=132489.34..132489.35 rows=1 width=0) (actual
time=382.987..382.987 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=175288
-> Gather (cost=1000.00..132488.34 rows=401 width=0) (actual
time=382.983..382.983 rows=0 loops=1)
Output: content
Number of Workers: 1
Buffers: shared hit=175288
-> Parallel Seq Scan on public.js (cost=0.00..131448.24
rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1)
Output: content
Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%design%'::text)
Rows Removed by Filter: 1724810
Buffers: shared hit=241201
Planning time: 0.104 ms
Execution time: 403.045 ms
(14 rows)

Time: 403.596 ms

The actual time of the sequential scan was 1141.437ms, but the total
execution time was 403.045ms.

And successive runs with max_parallel_degree = 1 also yield a
different number of rows removed by the filter, as well as a different
number of buffers being hit:

Run: rows removed / buffers hit
1: 1738517 / 243143
2: 1729361 / 241900
3: 1737168 / 242974
4: 1734440 / 242591

Thom

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-11-12 00:01:53 Re: Per-table log_autovacuum_min_duration is actually documented
Previous Message Alvaro Herrera 2015-11-11 21:27:58 Re: Per-table log_autovacuum_min_duration is actually documented