Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From: Amir Rohan <amir(dot)rohan(at)zoho(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Amir Rohan <amir(dot)rohan(at)mail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, gsmith(at)gregsmith(dot)com
Subject: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Date: 2015-10-03 12:50:28
Message-ID: 560FCF14.6000302@zoho.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/03/2015 02:38 PM, Michael Paquier wrote:
> On Fri, Oct 2, 2015 at 11:10 PM, Amir Rohan wrote:
>> On 10/02/2015 03:33 PM, Michael Paquier wrote:
>>> Any server instances created during the tests should never use a
>>> user-defined port for portability. Hence using those ports as keys
>>> just made sense. We could have for example custom names, that have
>>> port values assigned to them, but that's actually an overkill and
>>> complicates the whole facility.
>>>
>>
>> Something like:
>>
>> global nPortsAssigned = 0;
>> AssignPort() -> return is_ok(nPortsAssigned++)
>>
>> was what I used.
>
> Why do you need that. Creating a node is in the most basic way a
> matter of calling make_master, make_*_standby where a port number, or
> identifier gets uniquely assigned to a node. The main point of the
> approach taken by this patch is to make port assignment transparent
> for the caller.
>

See next.

>> Granted, you have to try fairly hard to shoot yourself in the leg,
>> but since the solution is so simple, why not? If we never reuse ports
>> within a single test, this goes away.
>
> Well, you can reuse the same port number in a test. Simply teardown
> the existing node and then recreate a new one. I think that port
> number assignment to a node should be transparent to the caller, in
> our case the perl test script holding a scenario.
>

What part of "Never assign the same port twice during one test"
makes this "not transparent to the user"?

If you're thinking about parallel test, I don't think you
need to worry. Availability checks take care of one part,
and the portnum-as-map-key-is-test-local takes care of the
other.

But, see next.

>
>>>> 4) Port assignment relies on liveness checks on running servers.
>>>> If a server is shut down and a new instantiated, the port will get
>>>> reused, data will get trashed, and various confusing things can happen.
>>>
>>> Right. The safest way to do that is to check in get_free_port if a
>>> port number is used by a registered node, and continue to loop in if
>>> that's the case. So done.
>>>
>>
>> That eliminates the "sweet and gentle" variant of the scenario, but it's
>> susceptible to the "ABA problem":
>> https://en.wikipedia.org/wiki/ABA_problem
>> https://youtu.be/CmxkPChOcvw?t=786
>
> I learnt a new thing here. That's basically an existing problem even
> with the existing perl test infrastructure relying on TestLib.pm when
> tests are run in parallel. What we would need here is a global mapping
> file storing all the port numbers used by all the nodes currently in
> the tests.
>

Yeah, a poorman's way to ensure ports aren't reused (I wasn't very
clear at top of post ) is something like:

global nPortsAssigned = 0;

AssignPort():
basePort = BASE_PORT; # the lowest port we use
while(!available(basePort+nPortsAssigned)):
basePort++

nPortsAssigned++

return basePort;

It has its glaring faults, but would probably work ok.
In any case, I'm sure you can do better.

>>
>> Granted, you have to try fairly hard to shoot yourself in the leg,
>> but since the solution is so simple, why not? If we never reuse ports
>> within a single test, this goes away.
>
> Well, you can reuse the same port number in a test. Simply teardown
> the existing node and then recreate a new one. I think that port
> number assignment to a node should be transparent to the caller, in
> our case the perl test script holding a scenario.
>

I was using you *never* want to reuse port numbers. That is, as long
as the lib ensures we never reuse ports within one test, all kinds
of corner cases are eliminated.

>>>> 5) Servers are shutdown with -m 'immediate', which can lead to races
>>>> in the script when archiving is turned on. That may be good for some
>>>> tests, but there's no control over it.
>>>
>>> I hesitated with fast here actually. So changed this way. We would
>>> want as wall a teardown command to stop the node with immediate and
>>> unregister the node from the active list.
>>>
>>
>> In particular, I was shutting down an archiving node and the archiving
>> was truncated. I *think* smart doesn't do this. But again, it's really
>> that the test writer can't easily override, not that the default is wrong.
>
> Ah, OK. Then fast is just fine. It shuts down the node correctly.
> "smart" would wait for all the current connections to finish but I am
> wondering if it currently matters here: I don't see yet a clear use
> case yet where it would make sense to have multi-threaded script... If
> somebody comes back with a clear idea here perhaps we could revisit
> that but it does not seem worth it now.
>

My mistake. Perhaps what would be useful though is a way
to force "messy" shutdown. a kill -9, basically. It's a recovery
test suite, right?.

>>>> Other issues:
>>>> 6. Directory structure, used one directory per thing but more logical
>>>> to place all things related to an instance under a single directory,
>>>> and name them according to role (57333_backup, and so on).
>>>
>>> Er, well. The first version of the patch did so, and then I switched
>>> to an approach closer to what the existing TAP facility is doing. But
>>> well let's simplify things a bit.
>>>
>>
>> I know, I know, but:
>> 1) an instance is a "thing" in your script, so having its associated
>> paraphernalia in one place makes more sense (maybe only to me).
>> 2) That's often how folks (well, how I) arrange things in deployment,
>> at least with archive/backup as symlinks to the nas.
>>
>> Alternatively, naming the dirs with a prefix (srv_foo_HASH,
>> backup_foo_backupname_HASH, etc') would work as well.
>
> The useful portion about tempdir is that it cleans up itself
> automatically should an error happen. It does not seem to me we want
> use that.
>

Ensuring cleanup and directory structure aren't inherently related.
Testlib makes cleanup easy if you're willing to accept its flat
structure. But writing something that does cleanup and lets yo
control directory structure is perfectly doable.

The question is only if you agree or not that having per-server
directories could be convenient. Tying into the next, if you
don't think anyone ever need to look into these directories
(which I disagree with), then dir structure indeed doesn't matter.

>> >> 8. No canned way to output a pprinted overview of the running system
>>>> (paths, ports, for manual checking).
>>>
>>> Hm. Why not... Are you suggesting something like print_current_conf
>>> that goes through all the registered nodes and outputs that? How would
>>> you use it?
>>>
>>
>> For one thin, I could open a few terminals and `$(print_env_for_server
>> 5437), so psql just worked.
>> I wish PEG had that as well.
>
> Hm. Isn't that coupled with the case where a failure happens then but
> tempdirs are cleaned up then?

But I've mentioned that's inconvenient as well. If you don't think
/that/ should be fixed, then yes, there's no point adding this.
Still, perhaps doing both (+previous) would make writing tests
easier.

At least for me, writing tests isn't simply a typing exercise.
I always need to inspect the system at stages during the test.
It's only after the test is ready that cleanup is useful, before
that it actually hampers work.

> I would expect hackers to run those runs
> until the end.

I agree -- when you're running them , but what about when you're
/writing/ them?

If a failure happens, it would then be useful to get a
> dump of what happens. However, it seems to me that we can get the same
> information by logging all the information when creating a node in the
> log file. I have added a routine in this sense, which is called each
> time a node is initialized. It seems helpful either way.
>

>>>> 11. a canned "server is responding to queries" helper would be convenient.
>>>
>>> Do you mean a wrapper on pg_isready? Do you have use cases in mind for it?
>>>
>>
>> Block until recovery is finished, before testing. eliminate races, and
>> avoid the stupid sleep(3) I used.
>
> TODO
>
>>>> 4b) server shutdown should perhaps be "smart" by default, or segmented
>>>> into calmly_bring_to_a_close(), pull_electric_plug() and
>>>> drop_down_the_stairs_into_swimming_pool().
>>>
>>> Nope, not agreeing here. "immediate" is rather violent to stop a node,
>>> hence I have switched it to use "fast" and there is now a
>>> teardown_node routine that uses immediate, that's more aimed at
>>> cleanup up existing things fiercely.
>>>
>>
>> Ok, not as the default, but possible to request a specific kind of
>> shutdown. I needed smart in my case. Plus, in a scenario, you might
>> expressly be testing behavior for a specific mode, it needs to be
>> controllable.
>
> If your test script is running with a single thread, "fast" or "smart"
> would not really make a difference, no?
>

It would If there's a bug in one of them and I'm trying to write
a regression test for it. Recall, this was part of broader view
of "provide defaults, allow override" I was suggesting.

>>> I have as well moved RecoveryTest.pm to src/test/perl so as all the
>>> consumers of prove_check can use it by default, and decoupled
>>> start_node from make_master and make_*_standby so as it is possible to
>>> add for example new parameters to their postgresql.conf and
>>> recovery.conf files before starting them.
>>>
>>> Thanks a lot for the feedback! Attached is an updated patch with all
>>> the things mentioned above done. Are included as well the typo fixes
>>> you sent upthread.
>>> Regards,
>>>
>>
>> Great! I'll look at it and post again if there's more. If any of the
>> above extra explanations make it clearer why I suggested some changes
>> you didn't like...
>
> I am attaching a new version. I found a small bug in test case 001
> when checking if standby 2 has caught up. There is also this dump
> function that is helpful. The -i switch in cp command has been removed
> as well.
>

I'm sorry I didn't review the code, but honestly my perl is so rusty I'm
afraid I'll embarrass myself :)

Thanks!
Amir

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amir Rohan 2015-10-03 13:04:45 Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Previous Message Michael Paquier 2015-10-03 11:38:57 Re: In-core regression tests for replication, cascading, archiving, PITR, etc.