Re: The ability of postgres to determine loss of files of the main fork

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Michael Banck <mbanck(at)gmx(dot)net>
Cc: Aleksander Alekseev <aleksander(at)tigerdata(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Frits Hoogland <frits(dot)hoogland(at)gmail(dot)com>
Subject: Re: The ability of postgres to determine loss of files of the main fork
Date: 2025-10-06 09:19:41
Message-ID: CAKZiRmzGgOY5iQ=CFHJVE2df2Bz599bACqoWP3V_e-3Rw+XJhA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 1, 2025 at 2:15 PM Michael Banck <mbanck(at)gmx(dot)net> wrote:

> > > That might be a lot of files to open.
> >
> > I was afraid of that, but let's say modern high-end is 200TB big DB,
> > that's like 200*1024 1GB files, but I'm getting such time(1) timings
> > for 204k files on ext4:
> >
> > $ time ./createfiles # real 0m2.157s, it's
> > open(O_CREAT)+close()
> > $ time ls -l many_files_dir/ > /dev/null # real 0m0.734s
> > $ time ./openfiles # real 0m0.297s , for
> > already existing ones (hot)
> > $ time ./openfiles # real 0m1.456s , for
> > already existing ones (cold, echo 3 > drop_caches sysctl)
> >
> > Not bad in my book as a one time activity. It could pose a problem
> > potentially with some high latency open() calls, maybe NFS or
> > something remote I guess.
>
> Yeah, did you try on SAN as well? I am doubtful that will be performant.

OK you I guess you wanted to have some latency + big fs size, so I do
not have real SAN here, but I've tried that on noisy Azure's Cloud 4x
disk Standard __HDD__ (probably SSD/flash but just throttled way
down). Each LUN of size: 32TB (max), giving me in total 128TB striped
fs:
- lvcreate -L 128G -n lv -i 4 -I 1M vgbig + mkfs.xfs
- ioping reports there the following latencies for Direct IO:
min/avg/max/mdev = 3.25 ms / 18.0 ms / 193.8 ms / 39.5 ms (yay!)
- so as per above it's doesn't have low latency at all
- simple fio (fio --name=rand_read_iops
--filename=/xfs/many_files_dir/test1 --rw=randread --bs=4k --direct=1
--ioengine=io_uring --iodepth=64 --size=4G --runtime=60 --time_based
--group_reporting) reports just: IOPS=959, BW=3837KiB/s

Yet, to create/open 204k empty files to simulate that 200TB cluster on
that 128TB fs:

$ time ./createfiles # real 0m2.140s
$ time ls -l /xfs/many_files_dir/ > /dev/null # real 0m0.697s
$ time ./openfiles # real 0m0.440s (hot)
$ time ./openfiles # real 0m29.872s
(after 3 to drop_caches)
$ time ./openfiles # real 0m0.443s (hot)
$ time ./openfiles # real 0m31.050s
(after 3 to drop_caches)

$ echo 3 > /proc/sys/vm/drop_caches
$ time ls -l /xfs/many_files_dir/ > /dev/null # real 1m2.345s
$ time ./openfiles # real 0m0.437s (hot)

Anyway, I do not know if opening all the files on startup (or just
crash-recovery?) is the proper way, personally I would take +30s to
open a 200TB database any day over the risk of silent data loss and
wrong results given the option. I'm sure however that ignoring missing
files on access is certainly not how things should be looking like and
I would speculate that you might even get some accidents that files
may go missing when up & running and you won't be notified in any way
(due to human error: imagine someone rsyncing the wrong [opposite]
direction he wanted by accident or something like that).

-J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2025-10-06 09:21:44 RE: Patch for migration of the pg_commit_ts directory v2
Previous Message Hayato Kuroda (Fujitsu) 2025-10-06 09:15:53 RE: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE