Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Date: 2021-06-22 09:30:38
Message-ID: ac119d1e-05d1-f050-b92a-0a524d68b848@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 18/06/2021 18:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference: 17064
> Logged by: Alexander Lakhin
> Email address: exclusion(at)gmail(dot)com
> PostgreSQL version: 14beta1
> Operating system: Ubuntu 20.04
> Description:
>
> The following script:
> ===
> for i in `seq 100`; do
> createdb db$i
> done
>
> # Based on the contents of the regression test "vacuum"
> echo "
> CREATE TABLE pvactst (i INT);
> INSERT INTO pvactst SELECT i FROM generate_series(1,10000) i;
> DELETE FROM pvactst;
> VACUUM pvactst;
> DROP TABLE pvactst;
>
> VACUUM FULL pg_database;
> " >/tmp/vacuum.sql
>
> for n in `seq 10`; do
> echo "iteration $n"
> for i in `seq 100`; do
> ( { for f in `seq 100`; do cat /tmp/vacuum.sql; done } | psql -d db$i )
>> psql-$i.log 2>&1 &
> done
> wait
> grep -C5 FATAL psql*.log && break;
> done
> ===
> detects sporadic FATAL errors:
> iteration 1
> psql-56.log-DROP TABLE
> psql-56.log-VACUUM
> psql-56.log-CREATE TABLE
> psql-56.log-INSERT 0 10000
> psql-56.log-DELETE 10000
> psql-56.log:FATAL: relation mapping file "global/pg_filenode.map" contains
> incorrect checksum
> psql-56.log-server closed the connection unexpectedly
> psql-56.log- This probably means the server terminated abnormally
> psql-56.log- before or while processing the request.
> psql-56.log-connection to server was lost

Hmm, the simplest explanation would be that the read() or write() on the
relmapper file is not atomic. We assume that it is, and don't use a lock
in load_relmap_file() because of that. Is there anything unusual about
the filesystem, mount options or the kernel you're using? I could not
reproduce this on my laptop. Does the attached patch fix it for you?

If that's the cause, it is easy to fix by taking the RelationMappingLock
in load_relmap_file(), like in the attached patch. But if the write is
not atomic, you might have a bigger problem: we also rely on the
atomicity when writing the pg_control file. If that becomes corrupt
because of a partial write, the server won't start up. If it's just a
race condition between the read/write, or only the read() is not atomic,
maybe pg_control is OK, but I'd like to understand the issue better
before just adding a lock to load_relmap_file().

- Heikki

Attachment Content-Type Size
lock-load_relmap_file-1.patch text/x-patch 3.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Mohan Nagandlla 2021-06-22 11:07:10 Re: BUG #17063: repmgrd_upstream_reconnect getting more
Previous Message Telford Tendys 2021-06-22 08:39:18 Unicode FFFF Special Codepoint should always collate high.