Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Christoph Berg <myon(at)debian(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
Date: 2023-04-24 15:58:55
Message-ID: 20230424155855.roatu3odubmue4i2@liskov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Mon, Apr 24, 2023 at 10:53:35AM +0200, Christoph Berg wrote:
> Re: Andres Freund
> > Add smgrzeroextend(), FileZero(), FileFallocate()
>
> Hi,
>
> I'm often seeing PG16 builds erroring out in the pgbench tests:
>
> 00:33:12 make[2]: Entering directory '/<<PKGBUILDDIR>>/build/src/bin/pgbench'
> 00:33:12 echo "# +++ tap check in src/bin/pgbench +++" && rm -rf '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && /bin/mkdir -p '/<<PKGBUILDDIR>>/build/src/bin/pgbench'/tmp_check && cd /<<PKGBUILDDIR>>/build/../src/bin/pgbench && TESTLOGDIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/log' TESTDATADIR='/<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check' PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/postgresql/16/bin:/<<PKGBUILDDIR>>/build/src/bin/pgbench:$PATH" LD_LIBRARY_PATH="/<<PKGBUILDDIR>>/build/tmp_install/usr/lib/aarch64-linux-gnu" PGPORT='65432' top_builddir='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../..' PG_REGRESS='/<<PKGBUILDDIR>>/build/src/bin/pgbench/../../../src/test/regress/pg_regress' /usr/bin/prove -I /<<PKGBUILDDIR>>/build/../src/test/perl/ -I /<<PKGBUILDDIR>>/build/../src/bin/pgbench --verbose t/*.pl
> 00:33:12 # +++ tap check in src/bin/pgbench +++
> 00:33:14 # Failed test 'concurrent OID generation status (got 2 vs expected 0)'
> 00:33:14 # at t/001_pgbench_with_server.pl line 31.
> 00:33:14 # Failed test 'concurrent OID generation stdout /(?^:processed: 125/125)/'
> 00:33:14 # at t/001_pgbench_with_server.pl line 31.
> 00:33:14 # 'pgbench (16devel (Debian 16~~devel-1.pgdg100+~20230423.1656.g8bbd0cc))
> 00:33:14 # transaction type: /<<PKGBUILDDIR>>/build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_concurrent_insert
> 00:33:14 # scaling factor: 1
> 00:33:14 # query mode: prepared
> 00:33:14 # number of clients: 5
> 00:33:14 # number of threads: 1
> 00:33:14 # maximum number of tries: 1
> 00:33:14 # number of transactions per client: 25
> 00:33:14 # number of transactions actually processed: 118/125
> 00:33:14 # number of failed transactions: 0 (0.000%)
> 00:33:14 # latency average = 26.470 ms
> 00:33:14 # initial connection time = 66.583 ms
> 00:33:14 # tps = 188.889760 (without initial connection time)
> 00:33:14 # '
> 00:33:14 # doesn't match '(?^:processed: 125/125)'
> 00:33:14 # Failed test 'concurrent OID generation stderr /(?^:^$)/'
> 00:33:14 # at t/001_pgbench_with_server.pl line 31.
> 00:33:14 # 'pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call
> 00:33:14 # HINT: Check free disk space.
> 00:33:14 # pgbench: error: Run was aborted; the above results are incomplete.
> 00:33:14 # '
> 00:33:14 # doesn't match '(?^:^$)'
> 00:33:26 # Looks like you failed 3 tests of 428.
> 00:33:26 t/001_pgbench_with_server.pl ..
> 00:33:26 not ok 1 - concurrent OID generation status (got 2 vs expected 0)
>
> I don't think the disk is full since it's always hitting that same
> spot, on some of the builds:
>
> https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/
>
> This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
> that test works though, and the FS seems to support posix_fallocate:
>
> #include <fcntl.h>
> #include <stdio.h>
>
> int main ()
> {
> int f;
> int err;
>
> if (!(f = open("moo", O_CREAT | O_RDWR, 0666)))
> perror("open");
>
> err = posix_fallocate(f, 0, 10);
> perror("posix_fallocate");
>
> return 0;
> }
>
> $ ./a.out
> posix_fallocate: Success
>
> The problem has been there for some weeks - I didn't report it earlier
> as I was on vacation, in the free time trying to bootstrap s390x
> support for apt.pg.o, and there was this other direct IO problem
> making all the builds fail for some time.

I noticed that dsm_impl_posix_resize() does a do while rc==EINTR and
FileFallocate() doesn't. From what the comment says in
dsm_impl_posix_resize() and some cursory googling, posix_fallocate()
doesn't restart automatically on most systems, so a do while() rc==EINTR
is often used. Is there a reason it isn't used in FileFallocate() I
wonder?

- Melanie

In response to

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2023-04-24 17:01:44 pgsql: Rename ExecAggTransReparent, and improve its documentation.
Previous Message Peter Eisentraut 2023-04-24 13:50:59 pgsql: doc: Update SQL features names

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2023-04-24 16:04:19 Re: Memory leak in CachememoryContext
Previous Message Tom Lane 2023-04-24 15:46:26 Re: Missing update of all_hasnulls in BRIN opclasses