Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date: 2022-08-02 21:18:19
Message-ID: CA+TgmobZufK6PcCRpvjP6dJqOBO-woWZF0cqMHWGvzkbgeaBKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 2, 2022 at 1:50 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> Unfortunately, that isn't very consistent, and you have have to run it a bunch
> of times...

I was eventually able to reproduce this in part by using the
interactive psql method you describe. It didn't crash, but it did spit
out a bunch of funny error messages:

postgres=# SET statement_timeout=0; DROP DATABASE a; SET
statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY
wal_log ; \c a \c postgres
SET
ERROR: database "a" does not exist
SET
ERROR: canceling statement due to statement timeout
WARNING: problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
WARNING: problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING: problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:
database "a" does not exist
Previous connection kept
postgres=# select * from pg_database;
oid | datname | datdba | encoding | datlocprovider | datistemplate
| datallowconn | datconnlimit | datfrozenxid | datminmxid |
dattablespace | datcollate | datctype | daticulocale |
datcollversion | datacl
-----+-----------+--------+----------+----------------+---------------+--------------+--------------+--------------+------------+---------------+-------------+-------------+--------------+----------------+----------------------------
5 | postgres | 10 | 6 | c | f
| t | -1 | 718 | 1 |
1663 | en_US.UTF-8 | en_US.UTF-8 | | |
1 | template1 | 10 | 6 | c | t
| t | -1 | 718 | 1 |
1663 | en_US.UTF-8 | en_US.UTF-8 | | |
{=c/rhaas,rhaas=CTc/rhaas}
4 | template0 | 10 | 6 | c | t
| f | -1 | 718 | 1 |
1663 | en_US.UTF-8 | en_US.UTF-8 | | |
{=c/rhaas,rhaas=CTc/rhaas}
(3 rows)

I then set backtrace_functions='AllocSetCheck' and reproduced it
again, which led to stack traces like this:

2022-08-02 16:50:32.490 EDT [98814] WARNING: problem in alloc set
PortalContext: bad single-chunk 0x7f9950886608 in block 0x7f9950885c00
2022-08-02 16:50:32.490 EDT [98814] BACKTRACE:
2 postgres 0x000000010cd37ef5 AllocSetCheck + 549
3 postgres 0x000000010cd37730 AllocSetReset + 48
4 postgres 0x000000010cd3f6f1
MemoryContextResetOnly + 81
5 postgres 0x000000010cd378b9 AllocSetDelete + 73
6 postgres 0x000000010cd41e09 PortalDrop + 425
7 postgres 0x000000010cd427bb
AtCleanup_Portals + 203
8 postgres 0x000000010c86476d
CleanupTransaction + 29
9 postgres 0x000000010c865d4f
AbortCurrentTransaction + 63
10 postgres 0x000000010cba1395 PostgresMain + 885
11 postgres 0x000000010caf5472 PostmasterMain + 7586
12 postgres 0x000000010ca31e3d main + 2205
13 libdyld.dylib 0x00007fff699afcc9 start + 1
14 ??? 0x0000000000000001 0x0 + 1

I recompiled with -O0 and hacked the code that emits the BACKTRACE:
bit to go into an infinite loop if it's hit, which enabled me to hook
up a debugger at the point of the failure. The debugger says:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x000000010e98a157
postgres`send_message_to_server_log(edata=0x000000010ec0f658) at
elog.c:2916:4
frame #1: 0x000000010e9866d6 postgres`EmitErrorReport at elog.c:1537:3
frame #2: 0x000000010e986016 postgres`errfinish(filename="aset.c",
lineno=1470, funcname="AllocSetCheck") at elog.c:592:2
frame #3: 0x000000010e9c8465
postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5
frame #4: 0x000000010e9c7c05
postgres`AllocSetDelete(context=0x00007ff77c80d200) at aset.c:638:2
frame #5: 0x000000010e9d368b
postgres`MemoryContextDelete(context=0x00007ff77c80d200) at
mcxt.c:252:2
* frame #6: 0x000000010e9d705b
postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at
portalmem.c:596:2
frame #7: 0x000000010e9d7e0e postgres`AtCleanup_Portals at portalmem.c:907:3
frame #8: 0x000000010e22030d postgres`CleanupTransaction at xact.c:2890:2
frame #9: 0x000000010e2219da postgres`AbortCurrentTransaction at
xact.c:3328:4
frame #10: 0x000000010e763237
postgres`PostgresMain(dbname="postgres", username="rhaas") at
postgres.c:4232:3
frame #11: 0x000000010e6625aa
postgres`BackendRun(port=0x00007ff77c1042c0) at postmaster.c:4490:2
frame #12: 0x000000010e661b18
postgres`BackendStartup(port=0x00007ff77c1042c0) at
postmaster.c:4218:3
frame #13: 0x000000010e66088a postgres`ServerLoop at postmaster.c:1808:7
frame #14: 0x000000010e65def2 postgres`PostmasterMain(argc=1,
argv=0x00007ff77ae05cf0) at postmaster.c:1480:11
frame #15: 0x000000010e50521f postgres`main(argc=1,
argv=0x00007ff77ae05cf0) at main.c:197:3
frame #16: 0x00007fff699afcc9 libdyld.dylib`start + 1
(lldb) fr sel 6
frame #6: 0x000000010e9d705b
postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at
portalmem.c:596:2
593 MemoryContextDelete(portal->holdContext);
594
595 /* release subsidiary storage */
-> 596 MemoryContextDelete(portal->portalContext);
597
598 /* release portal struct (it's in TopPortalContext) */
599 pfree(portal);
(lldb) fr sel 3
frame #3: 0x000000010e9c8465
postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5
1466 * Check chunk size
1467 */
1468 if (dsize > chsize)
-> 1469 elog(WARNING, "problem in alloc set %s: req size > alloc size
for chunk %p in block %p",
1470 name, chunk, block);
1471 if (chsize < (1 << ALLOC_MINBITS))
1472 elog(WARNING, "problem in alloc set %s: bad size %zu for chunk
%p in block %p",
(lldb) p dsize
(Size) $3 = 20
(lldb) p chsize
(Size) $4 = 0

It seems like CreateDatabaseUsingWalLog() must be doing something that
corrupts PortalContext, but at the moment I'm not sure what that thing
could be.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2022-08-02 21:26:28 Re: [PATCH] minor reloption regression tests improvement
Previous Message Jacob Champion 2022-08-02 21:14:08 Re: jsonpath syntax extensions