Reducing the runtime of the core regression tests

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Reducing the runtime of the core regression tests
Date: 2019-04-10 22:35:15
Message-ID: 735.1554935715@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Back in [1] I wrote

> I've wondered for some time whether we couldn't make a useful
> reduction in the run time of the PG regression tests by looking
> for scripts that run significantly longer than others in their
> parallel groups, and making an effort to trim the runtimes of
> those particular scripts.

I finally got some time to pursue that, and attached is a proposed patch
that moves some tests around and slightly adjusts some other ones.
To cut to the chase: on my workstation, this cuts the time for
"make installcheck-parallel" from 21.9 sec to 13.9 sec, or almost 40%.
I think that's a worthwhile improvement, considering how often all of us
run those tests.

Said workstation is an 8-core machine, so an objection could made
that maybe I'm optimizing too much for multicore. But even laptops
have multiple cores these days. To check ostensibly-worse cases,
I also tried this patch on dromedary's host (old dual-core Intel), and
found that installcheck-parallel went from about 92 seconds to about 82.
On gaur's host (single-core HPPA), the time went from 840 sec to 774.
So there's close to 10% savings even on very lame machines.

In no particular order, here's what I did:

* Move the strings and numerology tests to be part of the second
parallel test group; there is no reason to run them serially.

* Move the insert and insert_conflict tests to be part of the "copy"
parallel group. There is no reason to run them serially, plus they
were obviously placed with the aid of a dartboard, or at least without
concern for fixing comments one line away.

* Move the select and errors tests into the preceding parallel group
instead of running them serially. (This required adjusting the
constraints test, which uses a table named "tmp" as select also does.
I fixed that by making it a temp table in the constraints test.)

* create_index.sql ran much longer than other tests in its parallel
group, so I split out the SP-GiST-related tests into a new file
create_index_spgist.sql, and moved the delete_test_table test case
to btree_index.sql.

* Likewise, join.sql needed to be split up, so I moved the "exercises
for the hash join code" portion into a new file join_hash.sql.

* Likewise, I split up indexing.sql by moving the "fastpath" test into
a new file index_fastpath.sql.

* psql and stats_ext both ran considerably longer than other tests
in their group. I fixed that by moving them into the next parallel
group, where the rules test has a similar runtime. (To make it
safe to run stats_ext in parallel with rules, I adjusted the latter
to only dump views/rules from the pg_catalog and public schemas,
which was what it was doing anyway. stats_ext makes some views in
a transient schema, which now will not affect rules.)

* The plpgsql test ran much longer than others, which turns out to be
largely due to the 2-second timeout in its test of statement_timeout.
In view of the experience reflected in commit f1e671a0b, just
reducing that timeout seems unsafe. What I did instead was to shove
that test case and some related ones into a new plpgsql test file,
src/pl/plpgsql/src/sql/plpgsql_trap.sql, so that it's not part of the
core regression tests at all. (We've talked before about moving
chunks of plpgsql.sql into the plpgsql module, so this is sort of a
down payment on that.) Now, if you think about the time to do
check-world rather than just the core regression tests, this isn't
obviously a win, and in fact it might be a loss because the plpgsql
tests run serially not in parallel with anything else. However,
by that same token, the parallel-testing overload we were concerned
about in f1e671a0b should be much less bad in the plpgsql context.
I therefore took a chance on reducing the timeout down to 1 second.
If the buildfarm doesn't like that, we can change it back to 2 seconds
again. It should still be a net win because of the fact that
check-world runs the core tests more than once.

* Another thing I changed in the SP-GiST tests was to adjust the tests
that are trying to verify that KNN indexscan gives the same ordering
as seqscan-and-sort. Those were using FULL JOIN to match up rank()
results, which is horribly inefficient on this data set, because there
are 1000 duplicate entries in quad_point_tbl and hence 1000 rows with
the same rank; we proceed to form 1000000 join rows that we then have
to filter away again. What I did about that was to replace rank()
with row_number() so that the primary join key is unique, shaving well
over a second off the test's runtime. There is a small problem, namely
that the data set has two points that are different but yet have exactly
the same distance to the origin, causing their sort ordering to be
underdetermined. I think however that it's okay to simplify the queries
so that they just verify that we get the same values and ordering of the
distance results. The purpose of this test is not to see whether <->
gets the right answer, it is to see whether SP-GiST can return results
in the correct order according to <->, so I think it's okay to compare
only the distances and not the underlying points.

* Also, in polygon.sql, I removed quad_poly_tbl_ord_seq1 and
quad_poly_tbl_ord_idx1; the related queries are very expensive and
it's not clear what coverage they provide that isn't provided by
the near-duplicate tests involving quad_poly_tbl_ord_seq2 and
quad_poly_tbl_ord_idx2. (Note: polygon.sql seems to run proportionally
much slower on some machines than others. Unpatched, on my workstation
it's 3x slower than timestamptz, whereas on say longfin it's a good bit
faster. It might be interesting to look into why that is. But anyway,
this part of the patch benefits machines where it's slower.)

There are still a few tests that seem like maybe it'd be worth trimming,
but I felt like I'd hit a point of diminishing returns, so I stopped
here.

Thoughts? Anyone object to making these sorts of changes
post-feature-freeze?

regards, tom lane

[1] https://postgr.es/m/16646.1549770618@sss.pgh.pa.us

Attachment Content-Type Size
rearrange-regression-tests-for-more-parallelism.patch text/x-diff 292.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-04-10 22:48:30 Re: Reducing the runtime of the core regression tests
Previous Message Thomas Munro 2019-04-10 22:14:29 Re: B-tree cache prefetches