Fix for PL/Python slow input arrays traversal issue

From: Alexey Grishchenko <agrishchenko(at)pivotal(dot)io>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Alexey Grishchenko <programmerag(at)gmail(dot)com>
Subject: Fix for PL/Python slow input arrays traversal issue
Date: 2016-07-28 12:55:30
Message-ID: CAH38_tkwA5qgLV8zPN1OpPzhtkNKQb30n3xq-2NR9jUfv3qwHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

Following issue exists with PL/Python: when your function takes array as
input parameters, processing arrays of fixed-size elements containing null
values is many times slower than processing same array without nulls. Here
is an example:

-- Function

create or replace function test(a int8[]) returns int8 as $BODY$
return sum([x for x in a if x is not None])
$BODY$ language plpythonu volatile;

pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,100000) as a
pl_regression(# ) as q;
test
------------
5000050000
(1 row)

Time: 22.248 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,100000) as a
pl_regression(# union all
pl_regression(# select null::int8 as a
pl_regression(# ) as q;
test
------------
5000050000
(1 row)

Time: 7179.921 ms

As you can see, single null in array introduces 320x slowdown. The reason
for this is following:
Original implementation uses array_ref for each element of the array. Each
call to array_ref causes subsequent call to array_seek. Function array_seek
in turn has a shortcut for fixed-size arrays with no nulls. But if your
array is not of fixed-size elements, or if it contains nulls, each call to
array_seek would cause calculation of the Kth element offset starting from
the first element. This is O(N^2) algorithm, resulting in high processing
time for arrays of non-fixed-size elements and arrays with nulls.

The fix I propose applies same logic used at array_out function for
efficient array traversal, keeping the pointer to the last fetched
element's offset, which results in dramatical performance improvement for
affected cases. With this implementation, both arrays of fixed-size
elements without nulls, fixed-size elements with nulls and variable-size
elements are processed with the same speed. Here is the test after this fix
is applied:

pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,100000) as a
pl_regression(# ) as q;
test
------------
5000050000
(1 row)

Time: 21.056 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,100000) as a
pl_regression(# union all
pl_regression(# select null::int8 as a
pl_regression(# ) as q;
test
------------
5000050000
(1 row)

Time: 22.839 ms

--
Best regards,
Alexey Grishchenko

Attachment Content-Type Size
0001-Fix-for-PL-Python-slow-input-arrays-traversal-issue.patch application/octet-stream 3.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kouhei Kaigai 2016-07-28 13:11:54 Re: Oddity in EXPLAIN for foreign/custom join pushdown plans
Previous Message Fujii Masao 2016-07-28 12:41:03 Re: Wrong defeinition of pq_putmessage_noblock since 9.5