Re: [PATCH] initdb: do not exit after warn_on_mount_point

From: andrey(dot)arapov(at)nixaid(dot)com
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Julien Rouhaud" <rjuju123(at)gmail(dot)com>
Cc: "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] initdb: do not exit after warn_on_mount_point
Date: 2022-09-12 10:47:57
Message-ID: 21269e094e35a1b32c426408cc0a4408@nixaid.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

September 12, 2022 7:51 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Julien Rouhaud <rjuju123(at)gmail(dot)com> writes:
>
>> On Sun, Sep 11, 2022 at 06:22:21PM +0000, andrey(dot)arapov(at)nixaid(dot)com wrote:
>>> Everyone using containerized postgresql image cannot use /var/lib/postgresql
>>> as the mountpoint but has to use /var/lib/postgresql/data instead due to this
>>> issue [4] due to [5].
>>
>> initdb had this behavior for almost 10 years, and for good reasons: it prevents
>> actual problems that were seen in the field.
>
> The long and the short of this is that one person losing their data
> outweighs thousands of people being very mildly inconvenienced by
> having to create an extra level of directory. I understand that you
> don't think so, but you'd change your mind PDQ if it was *your* data
> that got irretrievably corrupted.
>
> We are not going to remove this check.
>
> If anything, the fault I'd find with the existing code is that it's not
> sufficiently thorough about detecting what is a mount point. AFAICS,
> neither the lost+found check nor the extra-dot-files check will trigger
> on modern filesystems such as XFS. I wonder if we could do something
> like comparing the stat(2) st_dev numbers for the proposed data directory
> vs. its parent directory.
>
> regards, tom lane

Thank you for the explanation Tom & Julien!

This is not an issue for the containerized PostgreSQL where the mountpoint is a required dependency without which the container won't even spawn nor start the scripts (such as initdb).
This also makes it difficult to comprehend the reason initdb exits when it sees PGDATA is a mountpoint it at the first hand.

That and the "Warn about initdb using mount-points" commit [4] made me thinking that this was a typo.

But I do understand that having it exiting on detecting PGDATA being a mountpoint have saved people from having their DB corrupted is a good enough reason for keeping thing as they are now.

== Summarizing the issue ==

The problem is that PostgreSQL will irrecoverably corrupt when the `PGDATA` mountpoint gets accidentally unmounted while the DB is up. [1]

The `PGDATA` can't just accidentally get unmounted due to the files being locked by a process there, unless there is a time window between the DB locking/unlocking the data in it of course.
Or unless someone forcibly unmounts it.

(I'd expect the DB to detect this & immediately terminate with a fatal error)

Or more likely is when `PGDATA` directory gets mounted back again _while_ PostgreSQL is running with a newly initialized cluster in the empty PGDATA directory (say, due to start-scripts running `initdb` or there was an old instance of a previously initialized cluster in the underlying PGDATA directory [with mountpoint being unmounted]), and then gets closed which in turn causes it writing the wrong data into the correct pg_control file.

== The solution (partial) ==

The solution to this was built into the `initdb` [4] - which makes sure it fails when it finds the `PGDATA` is a mountpoint.
This is currently achieved by `initdb` finding the `lost+found` directory in PGDATA, which is not a robust solution, since there are filesystems that do not place that file onto the mountpoint such as XFS, BTRFS, eCryptfs, ..., or one can erase the `lost+found` directory.
Hence, this is only a partial solution to the problem and looks to be more of a workaround nature IMHO.

== The recommendation ==

The recommendation is to mount an upper directory instead of PGDATA directly (one above the PGDATA, e.g. `$PGDATA/../.`) to make sure the DB or a certain* PostgreSQL start script will fail due to a missing `PGDATA` directory instead of running the `initdb` to re-init the cluster.

== The reason ==

This is because, as already mentioned before, running the `initdb` can lead to a situation where one can irrecoverably corrupt the DB by mounting the original PGDATA backing device back over again and stopping/re-starting the DB which will irrecoverably overwrite the good pg_control file.
(* - I assume that on some systems the postgresql start-script runs `initdb` when detects PGDATA directory is empty before starting the DB; like in the postgresql docker container [2] (by checking the presence of `$PGDATA/PG_VERSION`). And the default postgresql start script [3] runs the DB directly without issuing `initdb`.)

== Next steps? ==

I'm wondering whether that's the same issue for mariadb / mysql ...

And whether there could be a better way to handle this issue and protect the PostgreSQL from corrupting the DB in the event the PGDATA gets accidentally unmounted / re-mounted leading the the issue described above.
Solving this would allow people to actually mount the PGDATA directly, without the need to fix the `initdb`'s mountpoint detection for filesystems without `lost+found` directory.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1247477#c1
[2] https://github.com/docker-library/postgres/blob/74e51d102aede/docker-entrypoint.sh#L317
[3] https://github.com/postgres/postgres/blob/REL_14_5/contrib/start-scripts/linux#L94
[4] https://github.com/postgres/postgres/commit/17f15239325a88581bb4f9cf91d38005f1f52d69

Kind regards,
Andrey Arapov

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message kuroda.hayato@fujitsu.com 2022-09-12 10:57:42 RE: Perform streaming logical transactions by background workers and parallel apply
Previous Message Thomas Munro 2022-09-12 10:42:31 Re: Introduce wait_for_subscription_sync for TAP tests