pg_resetwal regression: could not upgrade after 1d863c2504

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: "'pgsql-hackers(at)lists(dot)postgresql(dot)org'" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: 'Peter Eisentraut' <peter(at)eisentraut(dot)org>
Subject: pg_resetwal regression: could not upgrade after 1d863c2504
Date: 2023-09-29 07:39:09
Message-ID: TYAPR01MB58664AD301F511B1EA5B72B4F5C0A@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear hackers,
(CC: Peter Eisentraut - committer of the problematic commit)

While developing pg_upgrade patch, I found a candidate regression for pg_resetwal.
It might be occurred due to 1d863c2504.

Is it really regression, or am I missing something?

# Phenomenon

pg_resetwal with relative path cannot be executed. It could be done at 7273945,
but could not at 1d863.

At 1d863:

```
$ pg_resetwal -n data_N1/
pg_resetwal: error: could not read permissions of directory "data_N1/": No such file or directory
```

At 7273945:

```
$ pg_resetwal -n data_N1/
Current pg_control values:

pg_control version number: 1300
Catalog version number: 202309251
...
```

# Environment

Attached script was executed on RHEL 7.9, gcc was 8.3.1.
I used meson build system with following options:

meson setup -Dcassert=true -Ddebug=true -Dc_args="-ggdb -O0 -g3 -fno-omit-frame-pointer"

# My analysis

I found that below part in GetDataDirectoryCreatePerm() returns false, it was a
cause.

```
/*
* If an error occurs getting the mode then return false. The caller is
* responsible for generating an error, if appropriate, indicating that we
* were unable to access the data directory.
*/
if (stat(dataDir, &statBuf) == -1)
return false;
```

Also, I found that the value DataDir in main() has relative path.
Based on that, upcoming stat() may not able to detect the given location because
the process has already located inside the directory.

```
(gdb) break chdir
Breakpoint 1 at 0x4016f0
(gdb) run -n data_N1

...
Breakpoint 1, 0x00007ffff78e1390 in chdir () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) print DataDir
$1 = 0x7fffffffe25c "data_N1"
(gdb) frame 1
#1 0x00000000004028d7 in main (argc=3, argv=0x7fffffffdf58) at ../postgres/src/bin/pg_resetwal/pg_resetwal.c:348
348 if (chdir(DataDir) < 0)
(gdb) print DataDir
$2 = 0x7fffffffe25c "data_N1"
```

# How to fix

One alternative approach is to call chdir() several times. PSA the patch.
(I'm not sure the commit should be reverted)

# Appendix - How did I find?

Originally, I found an issue when attached script was executed.
It creates two clusters and executes pg_upgrade, but failed with following output.
(I also attached whole output, please see result_*.out)

```
Performing Consistency Checks
-----------------------------
Checking cluster versions ok
pg_resetwal: error: could not read permissions of directory "data_N1": No such file or directory
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
test.sh application/octet-stream 322 bytes
result_7273945ca.out application/octet-stream 4.9 KB
result_1d863c2504.out application/octet-stream 2.6 KB
fix.patch application/octet-stream 690 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-09-29 08:02:28 Re: pg_resetwal tests, logging, and docs update
Previous Message Bharath Rupireddy 2023-09-29 07:30:04 Re: [PoC] pg_upgrade: allow to upgrade publisher node