POC: Better infrastructure for automated testing of concurrency issues

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: POC: Better infrastructure for automated testing of concurrency issues
Date: 2020-11-25 14:10:54
Message-ID: CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hackers,

PostgreSQL is a complex multi-process system, and we are periodically faced
with complicated concurrency issues. While the postgres community does a
great job on investigating and fixing the problems, our ability to
reproduce concurrency issues in the source code test suites is limited.

I think we currently have two general ways to reproduce the concurrency
issues.
1. A text scenario for manual reproduction of the issue, which could
involve psql sessions, gdb sessions etc. Couple of examples are [1] and
[2]. This method provides reliable reproduction of concurrency issues. But
it's hard to automate, because it requires external instrumentation
(debugger) and it's not stable in terms of postgres code changes (that is
particular line numbers for breakpoints could be changed). I think this is
why we currently don't have such scenarios among postgres test suites.
2. Another way is to reproduce the concurrency issue without directly
touching the database internals using pgbench or other way to simulate the
workload (see [3] for example). This way is easier to automate, because it
doesn't need external instrumentation and it's not so sensitive to source
code changes. But at the same time this way is not reliable and is
resource-consuming.

In the view of above, I'd like to propose a POC patch, which implements new
builtin infrastructure for reproduction of concurrency issues in automated
test suites. The general idea is so-called "stop events", which are
special places in the code, where the execution could be stopped on some
condition. Stop event also exposes a set of parameters, encapsulated into
jsonb value. The condition over stop event parameters is defined using
jsonpath language.

Following functions control behavior –
* pg_stopevent_set(stopevent_name, jsonpath_conditon) – sets condition for
the stop event. Once the function is executed, all the backends, which run
a given stop event with parameters satisfying the given jsonpath condition,
will be stopped.
* pg_stopevent_reset(stopevent_name) – resets stop events. All the
backends previously stopped on a given stop event will continue the
execution.

For sure, evaluation of stop events causes a CPU overhead. This is why
it's controlled by enable_stopevents GUC, which is off by default. I expect
the overhead with enable_stopevents = off shouldn't be observable. Even if
it would be observable, we could enable stop events only by specific
configure parameter. There is also trace_stopevents GUC, which traces all
the stop events to the log with debug2 level.

In the code stop events are defined using macro STOPEVENT(event_id,
params). The 'params' should be a function call, and it's evaluated only
if stop events are enabled. pg_isolation_test_session_is_blocked() takes
stop events into account. So, stop events are suitable for isolation tests.

POC patch comes with a sample isolation test in
src/test/isolation/specs/gin-traverse-deleted-pages.spec, which reproduces
the issue described in [2] (gin scan steps to the page concurrently deleted
by vacuum).

From my point of view, stop events would open great possibilities to
improve coverage of concurrency issues. They allow us to reliably test
concurrency issues in both isolation and tap test suites. And such test
suites don't take extraordinary resources for execution. The main cost
here is maintaining a set of stop events among the codebase. But I think
this cost is justified by much better coverage of concurrency issues.

The feedback is welcome.

Links.
1. https://www.postgresql.org/message-id/4E1DE580.1090905%40enterprisedb.com
2.
https://www.postgresql.org/message-id/CAPpHfdvMvsw-NcE5bRS7R1BbvA4BxoDnVVjkXC5W0Czvy9LVrg%40mail.gmail.com
3.
https://www.postgresql.org/message-id/BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6%40hintbits.com

------
Regards,
Alexander Korotkov

Attachment Content-Type Size
0001-Stopevents-v1.patch application/octet-stream 27.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Borisov 2020-11-25 14:22:41 Re: Is postgres ready for 2038?
Previous Message Greg Nancarrow 2020-11-25 13:54:39 Re: Parallel plans and "union all" subquery