Re: POC: converting Lists into arrays

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: POC: converting Lists into arrays
Date: 2019-03-04 19:01:33
Message-ID: 20190304190133.vtv7vifuhkaqwh67@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-03-02 18:11:43 -0500, Tom Lane wrote:
> On test cases like "pg_bench -S" it seems to be pretty much within the
> noise level of being the same speed as HEAD.

I think that might be because it's bottleneck is just elsewhere
(e.g. very context switch heavy, very few lists of any length).

FWIW, even just taking context switches out of the equation leads to
a ~5-6 %benefit in a simple statement:

DO $f$BEGIN FOR i IN 1..500000 LOOP EXECUTE $s$SELECT aid, bid, abalance, filler FROM pgbench_accounts WHERE aid = 2045530;$s$;END LOOP;END;$f$;

master:
+ 6.05% postgres postgres [.] AllocSetAlloc
+ 5.52% postgres postgres [.] base_yyparse
+ 2.51% postgres postgres [.] palloc
+ 1.82% postgres postgres [.] hash_search_with_hash_value
+ 1.61% postgres postgres [.] core_yylex
+ 1.57% postgres postgres [.] SearchCatCache1
+ 1.43% postgres postgres [.] expression_tree_walker.part.4
+ 1.09% postgres postgres [.] check_stack_depth
+ 1.08% postgres postgres [.] MemoryContextAllocZeroAligned

patch v3:
+ 5.77% postgres postgres [.] base_yyparse
+ 4.88% postgres postgres [.] AllocSetAlloc
+ 1.95% postgres postgres [.] hash_search_with_hash_value
+ 1.89% postgres postgres [.] core_yylex
+ 1.64% postgres postgres [.] SearchCatCache1
+ 1.46% postgres postgres [.] expression_tree_walker.part.0
+ 1.45% postgres postgres [.] palloc
+ 1.18% postgres postgres [.] check_stack_depth
+ 1.13% postgres postgres [.] MemoryContextAllocZeroAligned
+ 1.04% postgres libc-2.28.so [.] _int_malloc
+ 1.01% postgres postgres [.] nocachegetattr

And even just pgbenching the EXECUTEd statement above gives me a
reproducible ~3.5% gain when using -M simple, and ~3% when using -M
prepared.

Note than when not using prepared statement (a pretty important
workload, especially as long as we don't have a pooling solution that
actually allows using prepared statement across connections), even after
the patch most of the allocator overhead is still from list allocations,
but it's near exclusively just the "create a new list" case:

+ 5.77% postgres postgres [.] base_yyparse
- 4.88% postgres postgres [.] AllocSetAlloc
- 80.67% AllocSetAlloc
- 68.85% AllocSetAlloc
- 57.65% palloc
- 50.30% new_list (inlined)
- 37.34% lappend
+ 12.66% pull_var_clause_walker
+ 8.83% build_index_tlist (inlined)
+ 8.80% make_pathtarget_from_tlist
+ 8.73% get_quals_from_indexclauses (inlined)
+ 8.73% distribute_restrictinfo_to_rels
+ 8.68% RewriteQuery
+ 8.56% transformTargetList
+ 8.46% make_rel_from_joinlist
+ 4.36% pg_plan_queries
+ 4.30% add_rte_to_flat_rtable (inlined)
+ 4.29% build_index_paths
+ 4.23% match_clause_to_index (inlined)
+ 4.22% expression_tree_mutator
+ 4.14% transformFromClause
+ 1.02% get_index_paths
+ 17.35% list_make1_impl
+ 16.56% list_make1_impl (inlined)
+ 15.87% lcons
+ 11.31% list_copy (inlined)
+ 1.58% lappend_oid
+ 12.90% expression_tree_mutator
+ 9.73% get_relation_info
+ 4.71% bms_copy (inlined)
+ 2.44% downcase_identifier
+ 2.43% heap_tuple_untoast_attr
+ 2.37% add_rte_to_flat_rtable (inlined)
+ 1.69% btbeginscan
+ 1.65% CreateTemplateTupleDesc
+ 1.61% core_yyalloc (inlined)
+ 1.59% heap_copytuple
+ 1.54% text_to_cstring (inlined)
+ 0.84% ExprEvalPushStep (inlined)
+ 0.84% ExecInitRangeTable
+ 0.84% scanner_init
+ 0.83% ExecInitRangeTable
+ 0.81% CreateQueryDesc
+ 0.81% _bt_search
+ 0.77% ExecIndexBuildScanKeys
+ 0.66% RelationGetIndexScan
+ 0.65% make_pathtarget_from_tlist

Given how hard it is to improve performance with as flatly distributed
costs as the above profiles, I actually think these are quite promising
results.

I'm not even convinced that it makes all that much sense to measure
end-to-end performance here, it might be worthwhile to measure with a
debugging function that allows to exercise parsing, parse-analysis,
rewrite etc at configurable loop counts. Given the relatively evenly
distributed profiles were going to have to make a few different
improvements to make headway, and it's hard to see benefits of
individual ones if you look at the overall numbers.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-03-04 19:03:27 Re: POC: converting Lists into arrays
Previous Message Alvaro Herrera 2019-03-04 18:56:00 Re: pg_partition_tree crashes for a non-defined relation