Re: Adding facility for injection points (or probe points?) for more advanced tests

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding facility for injection points (or probe points?) for more advanced tests
Date: 2023-11-20 11:23:45
Message-ID: CAExHW5vZXq8WiTFiCfzk-f7Sqi+TMZ0Ty6XxehUZ9o91GP_AJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> Hi all,
>
> I don't remember how many times in the last few years when I've had to
> hack the backend to produce a test case that involves a weird race
> condition across multiple processes running in the backend, to be able
> to prove a point or just test a fix (one recent case: 2b8e5273e949).
> Usually, I come to hardcoding stuff for the following situations:
> - Trigger a PANIC, to force recovery.
> - A FATAL, to take down a session, or just an ERROR.
> - palloc() failure injection.
> - Sleep to slow down a code path.
> - Pause and release with condition variable.
>
> And, while that's helpful to prove a point on a thread, nothing comes
> out of it in terms of regression test coverage in the tree because
> these tests are usually too slow and expensive, as they usually rely
> on hardcoded timeouts. So that's pretty much attempting to emulate
> what one would do with a debugger in a predictable way, without the
> manual steps because human hands don't scale well.
>
> The reason behind that is of course more advanced testing, to be able
> to expand coverage when we have weird and non-deterministic race
> issues to deal with, and the code becoming more complex every year
> makes that even harder. Fault and failure injection in specific paths
> comes into mind, additionally, particularly if you manage complex
> projects based on Postgres.
>
> So, please find attached a patch set that introduces an in-core
> facility to be able to set what I'm calling here an "injection point",
> that consists of being able to register in shared memory a callback
> that can be run within a defined location of the code. It means that
> it is not possible to trigger a callback before shared memory is set,
> but I've faced far more the case where I wanted to trigger something
> after shmem is set anyway. Persisting an injection point across
> restarts is also possible by adding some through an extension's shmem
> hook, as we do for custom LWLocks for example, as long as a library is
> loaded.
>
> This will remind a bit of what Alexander Korotkov has proposed here:
> https://www.postgresql.org/message-id/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com
> Also, this is much closee to what Craig Ringer is mentioning here,
> where it is named probe points, but I am using a minimal design that
> allows to achieve the same:
> https://www.postgresql.org/message-id/CAPpHfdsn-hzneYNbX4qcY5rnwr-BA1ogOCZ4TQCKQAw9qa48kA%40mail.gmail.com
>
> A difference is that I don't really see a point in passing to the
> callback triggered an area of data coming from the hash table itself,
> as at the end a callback could just refer to an area in shared memory
> or a static set of variables depending on what it wants, with one or
> more injection points (say a location to set a state, and a second to
> check it). So, at the end, the problem comes down in my opinion to
> two things:
> - Possibility to trigger a condition defined by some custom code, in
> the backend (core code or even out-of-core).
> - Possibility to define a location in the code where a named point
> would be checked.
>
> 0001 introduces three APIs to create, run, and drop injection points:
> +extern void InjectionPointCreate(const char *name,
> + InjectionPointCallback callback);
> +extern void InjectionPointRun(const char *name);
> +extern void InjectionPointDrop(const char *name);
>
> Then one just needs to add a macro like that to trigger the callback
> registered in the code to test:
> INJECTION_POINT_RUN("String");
> So the footprint in the core tree is not zero, but it is as minimal as
> it can be.
>
> I have added some documentation to explain that, as well. I am not
> wedded to the name proposed in the patch, so if you feel there is
> better, feel free to propose ideas.

Actually with Attach and Detach terminology, INJECTION_POINT becomes
the place where we "declare" the injection point. So the documentation
needs to first explain INJECTION_POINT and then explain the other
operations.

>
> This facility is hidden behind a specific configure/Meson switch,
> making it a no-op by default:
> --enable-injection-points
> -Dinjection_points={ true | false }

That's useful, but we will also see demand to enable those in
production (under controlled circumstances). But let the functionality
mature under a separate flag and developer builds before used in
production.

>
> 0002 is a test module to test these routines, that I have kept a
> maximum simple to ease review of the basics proposed here. This could
> be extended further to propose more default modes with TAP tests on
> its own, as I don't see a real point in having the SQL bits or some
> common callbacks (like for the PANIC or the FATAL cases) in core.
>
> Thoughts and comments are welcome.

I think this is super useful functionality. Some comments here.

+/*
+ * Local cache of injection callbacks already loaded, stored in
+ * TopMemoryContext.
+ */
+typedef struct InjectionPointArrayEntry
+{
+ char name[INJ_NAME_MAXLEN];
+ InjectionPointCallback callback;
+} InjectionPointArrayEntry;
+
+static InjectionPointArrayEntry *InjectionPointCacheArray = NULL;

Initial size of this array is small, but given the size of code in a
given path to be tested, I fear that this may not be sufficient. I
feel it's better to use hash table whose APIs are already available.

+ test_injection_points_attach
+------------------------------
+
+(1 row)
+
+SELECT test_injection_points_run('TestInjectionBooh'); -- nothing

I find it pretty useless to expose that as a SQL API. Also adding it
in tests would make it usable as an example. In this patchset
INJECTION_POINT should be the only way to trigger an injection point.

That also brings another question. The INJECTION_POINT needs to be added in the
core code but can only be used through an extension. I think there should be
some rudimentary albeit raw test in core for this. Extension does some good
things like provides built-in actions when the injection is triggered. So, we
should keep those too.

I am glad that it covers the frequently needed injections error, notice and
wait. If someone adds multiple injection points for wait and all of them are
triggered (in different backends), test_injection_points_wake() would
wake them all. When debugging cascaded functionality it's not easy to
follow sequence add, trigger, wake for multiple injections. We should
associate a conditional variable with the required injection points. Attach
should create conditional variable in the shared memory for that injection
point and detach should remove it. I see this being mentioned in the commit
message, but I think it's something we need in the first version of "wait" to
be useful.

At some point we may want to control how many times an injection is
triggered by using a count. Often, I have wanted an ERROR to be thrown
in a code path once or twice and then stop triggering it. For example
to test WAL sender restart after a certain event. We can't really time
Detach correctly to avoid multiple restarts. A count is useful is such
a case.

+/*
+ * Attach a new injection point.
+ */
+void
+InjectionPointAttach(const char *name,
+ const char *library,
+ const char *function)
+{
+#ifdef USE_INJECTION_POINTS

In a non-injection-point build this function would be compiled and a call to
this function would throw an error. This means that any test we write which
uses injection points can not do so optionally. Tests which can be run with and
without injection points built will reduce duplication. We should define this
function as no-op in non-injection-point build to faciliate such tests.

Those tests need to also install extension. That's another pain point.
So anyone wants to run the tests needs to compile the extension too. I
am wondering whether we should have this functionality in the core
itself somewhere and will be only useful when built with injection.

Many a times it's only a single backend which needs to be subjected to
an injection. For inducing ERROR and NOTICE, many a times it's also
the same backend attached the client session. For WAIT, however we
need a way to inject from some other session. We might be able to use
current signalling mechanism for that (wake sends SIGUSR1 with
reason). Leaving aside WAIT for a moment when the same backend's
behaviour is being controlled, do we really need shared memory and
also affect all the running backends. I see some discussion about
being able to trigger only for a given PID, but when that PID of the
current backend itself shared memory is not required.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2023-11-20 11:32:47 Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans
Previous Message Dilip Kumar 2023-11-20 11:12:43 Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock