Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Amir Rohan <amir(dot)rohan(at)mail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, gsmith(at)gregsmith(dot)com
Subject: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Date: 2015-10-02 12:33:41
Message-ID: CAB7nPqRKvdvVLMivT35ySPG8arYx8hh9USn9y_Hogx8iGGOYZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 26, 2015 at 10:25 AM, Amir Rohan wrote:
> On 09/25/2015 01:47 PM, Michael Paquier wrote:
>> offer a way to set up a cluster ready for hacking in a single
>> command, and being able to switch among them easily. I am not sure we
>> would really find a cross-platform script generic enough to satisfy
>> all the needs of folks here and integrate it directly in the tree :)
>>
> Yes, perl/bash seems to be the standard and both have their shorcomings,
> in either convenience or familiarity...

Note as well that bash does not run on Windows, so this should be in
perl for the cross-platform requirements.

> Coming back to the recovery testing package...

Thanks for providing input on this patch!

> I was investigating some behaviour having to do with recovery
> and tried your new library to write a repro case. This uncovers some
> implicit assumptions in the package that can make things diffficult
> when violated. I had to rewrite nearly every function I used.

OK. Noted. Could you be more explicit here with for example an example
of script or a patch?

> Major pain points:
> 1) Lib stores metadata (ports,paths, etc') using ports as keys.
> -> Assumes ports aren't reused.

What would be the point of reusing the same port number in a test for
different nodes? Ports are assigned automatically depending on their
availability looking at if a server is listening to it so they are
never reused as long as the node is running so that's a non-issue IMO.
Any server instances created during the tests should never use a
user-defined port for portability. Hence using those ports as keys
just made sense. We could have for example custom names, that have
port values assigned to them, but that's actually an overkill and
complicates the whole facility.

> -> Because assumes server keep running until the tear down.

Check. Locking down the port number is the problem here.

> And
> 2) Behaviour (paths in particular) is hardwired rather then overridable
> defaults.

This is the case of all the TAP tests. We could always use the same
base directory for all the nodes and then embed a sub-directory whose
name is decided using the port number. But I am not really sure if
that's a win.

> This is exactly what I needed to test, problems:
> 3) Can't stop server without clearing its testing data (the maps holding
> paths and things). But that data might be specifically
> needed, in particular the backup shouldn't disappear when the
> server melts down or we have a very low-grade DBA on our hands.

OK, you have a point here. You may indeed want routines for to enable
and disable a node completely decoupled from start and stop, with
something like enable_node and disable_node that basically registers
or unregisters it from the list of active nodes. I have updated the
patch this way.

> 4) Port assignment relies on liveness checks on running servers.
> If a server is shut down and a new instantiated, the port will get
> reused, data will get trashed, and various confusing things can happen.

Right. The safest way to do that is to check in get_free_port if a
port number is used by a registered node, and continue to loop in if
that's the case. So done.

> 5) Servers are shutdown with -m 'immediate', which can lead to races
> in the script when archiving is turned on. That may be good for some
> tests, but there's no control over it.

I hesitated with fast here actually. So changed this way. We would
want as wall a teardown command to stop the node with immediate and
unregister the node from the active list.

> Other issues:
> 6. Directory structure, used one directory per thing but more logical
> to place all things related to an instance under a single directory,
> and name them according to role (57333_backup, and so on).

Er, well. The first version of the patch did so, and then I switched
to an approach closer to what the existing TAP facility is doing. But
well let's simplify things a bit.

> 7. enable_restoring() uses "cp -i" 'archive_command', not a good fit
> for an automated test.

This seems like a good default to me, and actually that's portable on
Windows easily. One could always append a custom archive_command in a
test when for example testing conflicting archiving when archive_mode
= always.

> Aside from running the tests, the convenience of writing them
> needs to be considered. My perl is very weak, it's been at least
> a decade, but It was difficult to make progress because everything
> is geared toward a batch "pass/fail" run . Stdout is redirected,
> and the log files can't be 'tail --retry -f' in another terminal,
> because they're clobbered at every run.

This relies on the existing TAP infrastructure and this has been
considered as the most portable way of doing by including Windows.
This patch is not aiming at changing that, and will use as much as
possible the existing infrastructure.

> 8. No canned way to output a pprinted overview of the running system
> (paths, ports, for manual checking).

Hm. Why not... Are you suggesting something like print_current_conf
that goes through all the registered nodes and outputs that? How would
you use it?

> 9. Finding things is difficult, See 6.

See my comment above.

> 10. If a test passes/fails or dies due to a bug, everything is cleaned.
> Great for testing, bad for postmortem.

That's something directly related to TestLib.pm where
File:Temp:tempdir creates a temporary path with CLEANUP = 1. We had
discussions regarding that actually...

> 11. a canned "server is responding to queries" helper would be convenient.

Do you mean a wrapper on pg_isready? Do you have use cases in mind for it?

> It might be a good idea to:
> 1) Never reuse ports during a test. Liveness checking is used
> to avoid collisions, but should not determine order of assignment.

Agreed. As far as I can see the problem here is related to the fact
that the port of non-running server may be fetched by another one.
That's a bug of my patch.

> 2) Decouple cleanup from server shutdown. Do the cleanup as the end of
> test only, and allow the user to keep things around.

Agreed here.

> 3) Adjust the directory structure to one top directory per server with
> (PGDATA, backup, archive) subdirs.

Hm. OK. The first version of the patch actually did so.

> 4) Instead of passing ports around as keys, have _explicit functions
> which can be called directly by the user (I'd like the backup *HERE*
> please), with the current functions refactored to merely invoke them
> by interpolating in the values associated with the port they were given.

I don't really see in what this would be a win. We definitely should
have all the data depending on temporary paths during the tests to
facilitate the cleanup wrapper work.

> 4b) server shutdown should perhaps be "smart" by default, or segmented
> into calmly_bring_to_a_close(), pull_electric_plug() and
> drop_down_the_stairs_into_swimming_pool().

Nope, not agreeing here. "immediate" is rather violent to stop a node,
hence I have switched it to use "fast" and there is now a
teardown_node routine that uses immediate, that's more aimed at
cleanup up existing things fiercely.

I have as well moved RecoveryTest.pm to src/test/perl so as all the
consumers of prove_check can use it by default, and decoupled
start_node from make_master and make_*_standby so as it is possible to
add for example new parameters to their postgresql.conf and
recovery.conf files before starting them.

Thanks a lot for the feedback! Attached is an updated patch with all
the things mentioned above done. Are included as well the typo fixes
you sent upthread.
Regards,
--
Michael

Attachment Content-Type Size
20151002_recovery_regressions_v4.patch text/x-patch 28.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amir Rohan 2015-10-02 14:10:07 Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Previous Message Nikolay Shaplov 2015-10-02 11:19:41 Re: pageinspect patch, for showing tuple data