Re: Online enabling of checksums

From: Andres Freund <andres(at)anarazel(dot)de>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Online enabling of checksums
Date: 2018-04-06 22:23:23
Message-ID: 20180406222323.4idsn37rye6zzvfj@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2018-04-06 02:28:17 +0200, Daniel Gustafsson wrote:
> Applying this makes the _cancel test pass, moving the failure instead to the
> following _enable test (which matches what coypu and mylodon are seeing).

FWIW, I'm somewhat annoyed that I'm now spending time debugging this to
get the buildfarm green again.

I'm fairly certain that the bug here is a simple race condition in the
test (not the main code!):

The flag informing whether the worker has started is cleared via an
on_shmem_exit() hook:

static void
launcher_exit(int code, Datum arg)
{
ChecksumHelperShmem->abort = false;
pg_atomic_clear_flag(&ChecksumHelperShmem->launcher_started);
}

but the the wait in the test is done via functions like:

CREATE OR REPLACE FUNCTION test_checksums_on() RETURNS boolean AS $$
DECLARE
enabled boolean;
BEGIN
LOOP
SELECT setting = 'on' INTO enabled FROM pg_catalog.pg_settings WHERE name = 'data_checksums';
IF enabled THEN
EXIT;
END IF;
PERFORM pg_sleep(1);
END LOOP;
RETURN enabled;
END;
$$ LANGUAGE plpgsql;

INSERT INTO t1 (b, c) VALUES (generate_series(1,10000), 'starting values');

CREATE OR REPLACE FUNCTION test_checksums_off() RETURNS boolean AS $$
DECLARE
enabled boolean;
BEGIN
PERFORM pg_sleep(1);
SELECT setting = 'off' INTO enabled FROM pg_catalog.pg_settings WHERE name = 'data_checksums';
RETURN enabled;
END;
$$ LANGUAGE plpgsql;

which just waits for setting checksums to have finished. It's
exceedingly unsurprising that a 'pg_sleep(1)' is not a reliable way to
make sure that a process has finished exiting. Then followup tests fail
because the process is still running

Also:
CREATE OR REPLACE FUNCTION reader_loop() RETURNS boolean AS $$
DECLARE
counter integer;
BEGIN
FOR counter IN 1..30 LOOP
PERFORM count(a) FROM t1;
PERFORM pg_sleep(0.2);
END LOOP;
RETURN True;
END;
$$ LANGUAGE plpgsql;
}

really? Let's just force the test take at least 6s purely from
sleeping?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2018-04-06 22:34:39 Re: PATCH: Configurable file mode mask
Previous Message David Steele 2018-04-06 22:04:44 Re: PATCH: Configurable file mode mask