Re: Running tests under valgrind is getting slower at an alarming pace

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Running tests under valgrind is getting slower at an alarming pace
Date: 2021-10-06 16:47:54
Message-ID: 20211006164754.7fytp2r3va6sh66m@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-10-06 12:09:36 -0400, Robert Haas wrote:
> Is the problem here that we're adding a lot of new new test cases? Or
> is the problem that valgrind runs are getting slower for the same
> number of test cases?

I don't know precisely. It's probably a combination of several factors. I do
think we regressed somewhere around valgrind specifically - the leak origin
tracking in older branches seems to work better than in newer branches. But I
don't know if that affects performance.

> If it's taking longer because we have more test cases, I'm honestly
> not sure that's really something we should try to fix.

I'm not arguing for having fewer tests. But clearly, executing them serially
is problematic, when the times are going up like this. Skink is hosted on a
machine with a CPU clocking around ~3.9GHZ for most of the test - getting a
faster machine won't help that much. But most of the time only a few cores are
active.

This isn't just a problem with valgrind, the reporting times for other animals
also aren't getting shorter...

It takes my workstation 2min20s to execute check-world parallely, but > 16min
sequentially. The BF executes tap tests sequentially...

> I mean, I'm sure we have some bad test cases here and there, but overall I
> think we still have too little test coverage, not too much. The recent
> discovery that recovery_end_command had zero test coverage is one fine
> example of that.
>
> But if we've done something that increases the relative cost of
> valgrind, maybe we can fix that in a centralized way.

There's probably some of that.

The fact that the tap test infrastructure does all communication with the
server via psql each only execute only a single query is a problem -
connection startup is expensive. I think this is particularly a problem for
things like PostgresNode::poll_query_until(), which is also used by
::wait_for_catchup(), because a) process creation is more expensive on
valgrind b) things take longer on valgrind, so we pay that price many more
times.

At the same time increasing the timeout for the poll loop also makes the tests
slower, because all the waits for things that already finished do add up.

I'd guess that at the very least driving individual poll_query_until() via a
psql that's running across queries would reduce this substantially, and
perhaps allow us to reduce the polling time. But there's probably some
nontrivial challenges around recognizing query boundaries :/

Briefly looking at a profile of valgrind, it looks like a lot of the cpu time
is spent doing syscalls related to logging. So far skink had
log_statement=all, log_connections=on, log_disconnections=on - I've turned
them off for the next runs. We'll see if that helps.

I'll also try to figure out print a bit more detail about timing for each tap
test, looks like I need to figure out how to pass PROVE_TEST='--timer' through
the buildfarm. Shouldn't be too hard.

One thing I think would really help is having the total time for each run
visible in an animals run history. That way we could pinpoint regressions
reasonably efficiently, right now that's not easily possible without writing
nontrivial queries to the buildfarm database...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2021-10-06 16:54:49 Re: storing an explicit nonce
Previous Message Bossart, Nathan 2021-10-06 16:43:05 Re: Parallel vacuum workers prevent the oldest xmin from advancing