Re: Why JIT speed improvement is so modest?

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: guangnan he <gnhe2009(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why JIT speed improvement is so modest?
Date: 2019-11-28 09:25:14
Message-ID: df689f9a-cc13-3dfe-35af-1a6b99c1ac60@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 28.11.2019 10:36, guangnan he wrote:
> Hi,
> you mean if we don't add new compiler options the compiler will do the
> loop unrolling using SIMD automatically?

Yes, most of modern compiler are doing it.
GCC requires -O3 option (-O2 is not enough), but clang is using them
even with -O2.

But Postgres is using more sophisticated Youngs-Cramer algorithm for
calculating SUM/AVG aggregates. And here SIMD instructions do not help much.
My original assumption was that huge difference in speed between
VOPS/ISPRAS JIT and Vanilla JIT can be explained by the difference in
accumulation algorithm.

This is why I implemented calculation of AVG in VOPS using Youngs-Cramer
algorithm.
And it certainly affect performance: Q1 with SUM aggregates is executed
by VOPS almost three times faster than with AVG aggregates (700 msec vs.
2000 msec).
But even with Youngs-Cramer algorithm VOPS is 6 times faster than
standard Postgres with JIT and 5 times faster than my in-memory storage.

> Beside the function calls, cache miss etc, for VOPS I think the call
> stack is squeezing too, but the JIT optimize still process rows one by
> one.
If we do not take in account overhead of heap traversal and tuples
packing then amount of calculations doesn't depend on data model:
whether it is vertical or horizontal.
By implementing in-memory storage which just keeps unpacked tuples in L2
list in backend's private memory and so doesn't spend time for unpacking
or visibility checks
I  want to exclude this overhead and reach almost the same speed as VOPS.
But it doesn't happen.

>
> Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru
> <mailto:k(dot)knizhnik(at)postgrespro(dot)ru>> 于2019年11月28日周四 下午3:08写道:
>
>
>
> On 27.11.2019 19:05, Tomas Vondra wrote:
> > On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
> >>
> >>
> >> On 25.11.2019 18:24, Merlin Moncure wrote:
> >>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
> >>> <k(dot)knizhnik(at)postgrespro(dot)ru <mailto:k(dot)knizhnik(at)postgrespro(dot)ru>>
> wrote:
> >>>> JIT was not able to significantly (times) increase speed on
> Q1 query?
> >>>> Experiment with VOPS shows that used aggregation algorithm
> itself
> >>>> is not
> >>>> a bottleneck.
> >>>> Profile also give no answer for this question.
> >>>> Any ideas?
> >>> Well, in the VOPS variant around 2/3 of the time is spent in
> routines
> >>> that are obviously aggregation.  In the JIT version, it's
> around 20%.
> >>> So this suggests that the replacement execution engine is more
> >>> invasive.  I would also guess (!) that the VOPS engine
> optimizes fewer
> >>> classes of query plan.   ExecScan for example, looks to be
> completely
> >>> optimized out VOPS but is still utilized in the JIT engine.
> >>
> >> The difference in fraction of time spent in aggregate
> calculation is
> >> not so large (2 times vs. 10 times).
> >> I suspected that a lot of time is spent in relation traversal
> code,
> >> tuple unpacking and visibility checks.
> >> To check this hypothesis I have implement in-memory table access
> >> method which stores tuples in unpacked form and
> >> doesn't perform any visibility checks at all.
> >> Results were not so existed. I have to disable parallel execution
> >> (because it is not possible for tuples stored in backend private
> >> memory).
> >> Results are the following:
> >>
> >> lineitem:               13736 msec
> >> inmem_lineitem:  10044 msec
> >> vops_lineitem:        1945 msec
> >>
> >> The profile of inmem_lineitem is the following:
> >>
> >>   16.79%  postgres  postgres             [.] float4_accum
> >>   12.86%  postgres  postgres             [.] float8_accum
> >>    5.83%  postgres  postgres             [.]
> TupleHashTableHash.isra.8
> >>    4.44%  postgres  postgres             [.] lookup_hash_entries
> >>    3.37%  postgres  postgres             [.] check_float8_array
> >>    3.11%  postgres  postgres             [.] tuplehash_insert
> >>    2.91%  postgres  postgres             [.] hash_uint32
> >>    2.83%  postgres  postgres             [.] ExecScan
> >>    2.56%  postgres  postgres             [.] inmem_getnextslot
> >>    2.22%  postgres  postgres             [.] FunctionCall1Coll
> >>    2.14%  postgres  postgres             [.] LookupTupleHashEntry
> >>    1.95%  postgres  postgres             [.]
> TupleHashTableMatch.isra.9
> >>    1.76%  postgres  postgres             [.] pg_detoast_datum
> >>    1.58%  postgres  postgres             [.] AggCheckCallContext
> >>    1.57%  postgres  postgres             [.] tts_minimal_clear
> >>    1.35%  postgres  perf-3054.map        [.] 0x00007f558db60010
> >>    1.23%  postgres  postgres             [.] fetch_input_tuple
> >>    1.15%  postgres  postgres             [.] SeqNext
> >>    1.06%  postgres  postgres             [.] ExecAgg
> >>    1.00%  postgres  postgres             [.]
> tts_minimal_store_tuple
> >>
> >> So now fraction of time spent in aggregation is increased to
> 30% (vs.
> >> 20% for lineitem and 42% for vops_lineitem).
> >> Looks like the main bottleneck now is hashagg. VOPS is
> accessing hash
> >> about 10 times less (because it accumulates values for the
> whole tile).
> >> And it explains still large difference bwtween vops_lineitem and
> >> inmem_lineitem.
> >>
> >> If we remove aggregation and rewrite Q1 query as:
> >> select
> >>     avg(l_quantity) as sum_qty,
> >>     avg(l_extendedprice) as sum_base_price,
> >>     avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
> >>     avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
> >>     avg(l_quantity) as avg_qty,
> >>     avg(l_extendedprice) as avg_price,
> >>     avg(l_discount) as avg_disc,
> >>     count(*) as count_order
> >> from
> >>     inmem_lineitem
> >> where
> >>     l_shipdate <= '1998-12-01';
> >>
> >> then results are the following:
> >> lineitem:               9805 msec
> >> inmem_lineitem:  6257 msec
> >> vops_lineitem:      1865 msec
> >>
> >> and now profile of inmem_lineitem is:
> >>
> >>   25.27%  postgres  postgres           [.] float4_accum
> >>   21.86%  postgres  postgres           [.] float8_accum
> >>    5.49%  postgres  postgres           [.] check_float8_array
> >>    4.57%  postgres  postgres           [.] ExecScan
> >>    2.61%  postgres  postgres           [.] AggCheckCallContext
> >>    2.30%  postgres  postgres           [.] pg_detoast_datum
> >>    2.10%  postgres  postgres           [.] inmem_getnextslot
> >>    1.81%  postgres  postgres           [.] SeqNext
> >>    1.73%  postgres  postgres           [.] fetch_input_tuple
> >>    1.61%  postgres  postgres           [.] ExecAgg
> >>    1.23%  postgres  postgres           [.] MemoryContextReset
> >>
> >> But still more than 3 times difference with VOPS!
> >> Something is wrong here...
> >>
> >
> > I have no idea what VOPS does, but IIRC one of the bottlenecks
> compared
> > to various column stores is our iterative execution model, which
> makes
> > it difficult/imposible to vectorize operations. That's likely
> why the
> > accum functions are so high in the CPU profile.
> >
> > regards
> >
>
> VOPS is doing very simple thing: it replaces scala types with vector
> (tiles) and define all standard operations for them.
> Also it provides Postgres aggregate for this types.
> So while for normal Postgres table, the query
>
> select sum(x) from T;
>
> calls float4_accum for each row of T, the same query in VOPS will
> call
> vops_float4_avg_accumulate for each tile which contains 64 elements.
> So vops_float4_avg_accumulate is called 64 times less than
> float4_accum.
> And inside it contains straightforward loop:
>
>              for (i = 0; i < TILE_SIZE; i++) {
>                  sum += opd->payload[i];
>              }
>
> which can be optimized by compiler (loop unrolling, use of SIMD
> instructions,...).
> So no wonder that VOPS is faster than Postgres executor.
> But Postgres now contains JIT and it is used in this case.
> So interpretation overhead of executor should be mostly eliminated
> by JIT.
> In theory, perfect JIT code should process rows of horizontal data
> model
> at the same speed as vector executor processing columns of
> vertical data
> model.
> Vertical model provides signficatn advantages when a query affect
> only
> small fraction of rows.
> But in case of Q1 we are calculating 8 aggregates for just 4 columns.
> And inmem_lineitem is actually projection of original lineitem table
> containing only  columns needed for this query.
> So amount of fetched data in this case is almost the same for
> horizontal
> and vertical data models.
> Effects of CPU caches should not also play significant role in
> this case.
> That is why it is not quite clear to me why there is still big
> difference (3 times) between VOPS and in-memory table and not so
> large
> difference between normal and in-memory tables.
>
> Concerning large percent spent in accumulate function - I do not
> agree
> with you. What this query is actually doing is just calculating
> aggregates.
> The less is interpretation overhead the larger percent of time we
> should
> spent in aggregate function.
> May be the whole infrastructure of Postgres aggregates adds too large
> overhead (check_float8_array, function calls,...) and in case of VOPS
> this overhead is divided by 64.
>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
>
>
> --
> Guang-Nan He
>

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2019-11-28 09:45:58 Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.
Previous Message Hubert Zhang 2019-11-28 09:23:59 Yet another vectorized engine