|From:||Daniel Gustafsson <daniel(at)yesql(dot)se>|
|To:||Heikki Linnakangas <hlinnaka(at)iki(dot)fi>|
|Cc:||Michael Banck <michael(dot)banck(at)credativ(dot)de>, Magnus Hagander <magnus(at)hagander(dot)net>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>|
|Subject:||Re: Online checksums patch - once again|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
> On 27 Jan 2021, at 16:37, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Revisiting an issue we discussed earlier:
> On 25/11/2020 15:20, Daniel Gustafsson wrote:
>> On 23 Nov 2020, at 18:36, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>> On 17/11/2020 10:56, Daniel Gustafsson wrote:
>>>> I've reworked this in the attached such that the enable_ and
>>>> disable_ functions merely call into the launcher with the desired
>>>> outcome, and the launcher is responsible for figuring out the
>>>> rest. The datachecksumworker is now the sole place which
>>>> initiates a state transfer.
>>> Well, you still fill the DatachecksumsWorkerShmem->operations array
>>> in the backend process that launches the datacheckumworker, not in
>>> the worker process. I find that still a bit surprising, but I
>>> believe it works.
>> I'm open to changing it in case there are strong opinions, it just
>> seemed the most natural to me.
> This kept bothering me, so I spent a while hacking this to my liking. The attached patch moves the code to fill in 'operations' from the backend to the launcher, so that the pg_enable/disable_checksums() call now truly just stores the desired state, checksum on or off, in shared memory, and launches the launcher process. The launcher process figures out how to get to the desired state. This removes the couple of corner cases that previously emitted a NOTICE about the processing being concurrently disabled or aborted. What do you think?
I like it, it does avoid a few edgecases, while some are moved around (more on
that below) yielding easier to read code. This wasn't really how I thought you
wanted it when we talked about this, so I'm very glad to see it in code since I
now get what you were saying. Thanks!
> I haven't done much testing, so if you adopt this approach, please check if I broke something in the process.
I think you broke restarts: the enable_checksums variable wasn't populated so
the restart seemed to fail to set the right operations. Also, the false return
can now mean an actual failure as well as an abort case for a restart. I've
done a quick change to look for an updated state, but that doesn't cover the
case when processing fails *and* a new state has been set. Maybe the best
solution is to change return type for processing to int's such that the three
cases (failure, abort, success) can be returned?
Also, this fails on assertion for me when checking the current state since
interrupts aren't held off. I have a feeling it should've done that in my
version too as I look at it? The attached holds off interrutps to pass that
(we clearly don't want a new state while we decide on operations anyways).
The tests in src/test/checksum were reliably failing on these for me but pass
with the attached 0002 applied.
> This changes the way the abort works. If the pg_enable/disable_checksums() is called, while the launcher is already busy changing the state, pg_enable/disable_checksums() will just set the new desired state in shared memory anyway. The launcher process will notice that the target state changed some time later, and restart from scratch.
The "notice period" is the same as before though, unless I'm missing something.
> A couple of other issues came up while doing that:
> - AbortProcessing() has two callers, one in code that runs in the launcher process, and another one in code that runs in the worker process. Is it really safe to use from the worker process? Calling ProcessAllDatabases() in the worker seems sketchy. (This is moot in this new patch version, as I removed AbortProcessing() altogether)
I think it was safe, but I agree that this version is a lot cleaner so it is as
you say moot.
> - Is it possible that the worker keeps running after the launcher has already exited, e.g. because of an ERROR or SIGTERM? If you then quickly call pg_enable_checksums() again, can you end up with two workers running at the same time? Is that bad?
True, I think there is a window for when that could happen, but worst case
should be that a database is checksummed twice?
> On 26/01/2021 23:00, Daniel Gustafsson wrote:
>>> On 22 Jan 2021, at 12:55, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>>> @@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname,
>>>> relkind == RELKIND_MATVIEW)
>>>> + /*
>>>> + * Set the data checksum state. Since the data checksum state can change at
>>>> + * any time, the fetched value might be out of date by the time the
>>>> + * relation is built. DataChecksumsNeedWrite returns true when data
>>>> + * checksums are: enabled; are in the process of being enabled (state:
>>>> + * "inprogress-on"); are in the process of being disabled (state:
>>>> + * "inprogress-off"). Since relhaschecksums is only used to track progress
>>>> + * when data checksums are being enabled, and going from disabled to
>>>> + * enabled will clear relhaschecksums before starting, it is safe to use
>>>> + * this value for a concurrent state transition to off.
>>>> + *
>>>> + * If DataChecksumsNeedWrite returns false, and is concurrently changed to
>>>> + * true then that implies that checksums are being enabled. Worst case,
>>>> + * this will lead to the relation being processed for checksums even though
>>>> + * each page written will have them already. Performing this last shortens
>>>> + * the window, but doesn't avoid it.
>>>> + */
>>>> + HOLD_INTERRUPTS();
>>>> + rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite();
>>>> + RESUME_INTERRUPTS();
>>>> * Okay to insert into the relcache hash table.
>>> I grepped for relhashcheckums, and concluded that the value in the
>>> relcache isn't actually used for anything. Not so! In
>>> heap_create_with_catalog(), the actual pg_class row is constructed
>>> from the relcache entry, so the value set in
>>> RelationBuildLocalRelation() finds its way to pg_class. Perhaps it
>>> would be more clear to pass relhachecksums directly as an argument
>>> to AddNewRelationTuple(). That way, the value in the relcache would
>>> be truly never used.
>> I might be thick (or undercaffeinated) but I'm not sure I follow. AddNewRelationTuple calls InsertPgClassTuple which in turn avoids the
>> relcache entry.
> Ah, you're right, I misread AddNewRelationTuple. That means that the relhaschecksums field in the relcache is never used? That's a clear rule. The changes to formrdesc() and RelationBuildLocalRelation() seem unnecessary then, we can always initialize relhaschecksums to false in the relcache.
They probably are, but should they be kept as a just-in-case for future hackery
which may assume the relcache is at all points correct?
I've attached my changes as 0002 here on top of your patch, what do you think
about those? There are some commentfixups as well, some stemming from your
patch and some from much earlier versions that I just hadn't seen until now.
Daniel Gustafsson https://vmware.com/
|Next Message||Daniel Gustafsson||2021-01-28 23:20:21||Re: Support for NSS as a libpq TLS backend|
|Previous Message||Thomas Munro||2021-01-28 22:13:03||Re: [PATCH] remove pg_standby|