elog/ereport VS misleading backtrace_function function address

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: elog/ereport VS misleading backtrace_function function address
Date: 2024-03-28 12:01:28
Message-ID: CAKZiRmwEojZ1awF2TD_URwHgOJd5Bx4NKaiTSiQM32NDNqODSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi -hackers,

While chasing some other bug I've learned that backtrace_functions
might be misleading with top elog/ereport() address.

Reproducer:

# using Tom's reproducer on master:
wget https://www.postgresql.org/message-id/attachment/112394/ri-collation-bug-example.sql
echo '' >> ri-collation-bug-example.sql
echo '\errverbose' >> ri-collation-bug-example.sql
-- based on grepping the source code locations many others could go in here:
psql -p 5437 -c "alter system set backtrace_functions =
'RI_FKey_cascade_del,get_collation_isdeterministic';"
psql -p 5437 -c "select pg_reload_conf();"
dropdb -p 5437 test1 ; createdb -p 5437 test1 ; psql -p 5437 -d test1
-f ri-collation-bug-example.sql

gives (note "get_collation_isdeterministic"):
psql:ri-collation-bug-example.sql:42: ERROR: cache lookup failed
for collation 0
psql:ri-collation-bug-example.sql:44: error: ERROR: XX000: cache
lookup failed for collation 0
LOCATION: get_collation_isdeterministic, lsyscache.c:1088

and in log note the 0x14c6bb:
2024-03-27 14:39:13.065 CET [12899] postgres(at)test1 ERROR: cache
lookup failed for collation 0
2024-03-27 14:39:13.065 CET [12899] postgres(at)test1 BACKTRACE:
postgres: 16/main: postgres test1 [local] DELETE(+0x14c6bb)
[0x5565c5a9c6bb]
postgres: 16/main: postgres test1 [local]
DELETE(RI_FKey_cascade_del+0x323) [0x5565c5ec0973]
postgres: 16/main: postgres test1 [local] DELETE(+0x2d401f)
[0x5565c5c2401f]
postgres: 16/main: postgres test1 [local] DELETE(+0x2d5a04)
[0x5565c5c25a04]
postgres: 16/main: postgres test1 [local]
DELETE(AfterTriggerEndQuery+0x78) [0x5565c5c2a918]
[..]
2024-03-27 14:39:13.065 CET [12899] postgres(at)test1 STATEMENT: delete
from revisions where id='5c617ce7-688d-4bea-9d66-c0f0ebc635da';

based on \errverbose it is OK (error matches the code, thanks to
Alvaro for this hint):

9 bool
8 get_collation_isdeterministic(Oid colloid)
7 {
6 │ HeapTuple> tp;
5 │ Form_pg_collation colltup;
4 │ bool> > result;
3 │
2 │ tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(colloid));
1 │ if (!HeapTupleIsValid(tp))
1088 │ │ elog(ERROR, "cache lookup failed for collation %u", colloid);
[..]

but based on backtrace address 0x14c6bb (!) and it resolves
differently which seems to be highly misleading (note the
"get_language_name.cold"):

$ addr2line -e /usr/lib/postgresql/16/bin/postgres -a -f 0x14c6bb
0x000000000014c6bb
get_language_name.cold
./build/src/backend/utils/cache/./build/../src/backend/utils/cache/lsyscache.c:1181

When i disassemble the get_collation_isdeterministic() (and I know the
name from \errverbose):

Dump of assembler code for function get_collation_isdeterministic:
Address range 0x5c7680 to 0x5c76c1:
0x00000000005c7680 <+0>: push %rbp
[..]
0x00000000005c7694 <+20>: call 0x5d63b0 <SearchSysCache1>
0x00000000005c7699 <+25>: test %rax,%rax
0x00000000005c769c <+28>: je 0x14c686
<get_collation_isdeterministic.cold>
0x00000000005c76a2 <+34>: mov %rax,%rdi
[..]
0x00000000005c76bf <+63>: leave
0x00000000005c76c0 <+64>: ret
Address range 0x14c686 to 0x14c6bb:
0x000000000014c686 <-4698106>: xor %esi,%esi
0x000000000014c688 <-4698104>: mov $0x15,%edi
0x000000000014c68d <-4698099>: call 0x14ec86 <errstart_cold>
0x000000000014c692 <-4698094>: mov %r12d,%esi
0x000000000014c695 <-4698091>: lea 0x5028dc(%rip),%rdi
# 0x64ef78
0x000000000014c69c <-4698084>: xor %eax,%eax
0x000000000014c69e <-4698082>: call 0x5df320 <errmsg_internal>
0x000000000014c6a3 <-4698077>: lea 0x6311a6(%rip),%rdx
# 0x77d850 <__func__.34>
0x000000000014c6aa <-4698070>: mov $0x440,%esi
0x000000000014c6af <-4698065>: lea 0x630c8a(%rip),%rdi
# 0x77d340
0x000000000014c6b6 <-4698058>: call 0x5df100 <errfinish>
<< NOTE the exact 0x14c6bb is even missing here(!)

notice the interesting thing here: according to GDB
get_collation_isdeterministic() is @ 0x5c7680 + jump to 0x14c686
<get_collation_isdeterministic.cold> till 0x14c6bb (but without it)
without any stack setup for that .cold. But backtrace() just captured
the elog/ereport (cold) at the end of 0x14c6bb instead, so if I take
that exact address from backtrace_functions output as it is
("DELETE(+0x14c6bb)"), it also shows WRONG function (just like
addr2line):

(gdb) disassemble 0x14c6bb
Dump of assembler code for function get_language_name:
Address range 0x5c7780 to 0x5c77ee:
[..]
0x00000000005c77ed <+109>: ret
Address range 0x14c6bb to 0x14c6f0:
0x000000000014c6bb <-4698309>: xor %esi,%esi
[..]
0x000000000014c6eb <-4698261>: call 0x5df100 <errfinish>
End of assembler dump.

so this is wrong (as are failing in "get_collation_isdeterministic"
not in "get_language_name").

I was very confused at the beginning with the main question being: why
are we compiling elog/ereport() so that it is incompatible with
backtrace? When looking for it I've found two threads [1] by David
that were actually helpful in understanding that this was done for
performance (unlikley(), cold attribute and similiar type of
discussions). Now my thought is that for >= ERROR in ereport_domain we
could add something more (?) before pg_unreachable() that would help
the backtrace to resolve the symbol correctly

If I try non-portable PoC with x86 nop instruction:

--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -145,8 +145,10 @@ struct Node;
errstart_cold(elevel, domain) : \
errstart(elevel, domain)) \
__VA_ARGS__, errfinish(__FILE__, __LINE__, __func__); \
- if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
+ if (__builtin_constant_p(elevel) && (elevel) >= ERROR) { \
+ __asm__ __volatile__("nop"); \
pg_unreachable(); \
+ } \

it partially works and the address can be now properly resolved!

Somewhat backtrace() still is unable to lookup the to do so in log itself:
postgres: postgres test1 [local] DELETE(+0x15f84c) [0x55703bf5284c]
postgres: postgres test1 [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x55703c34c850]

but at least:

postgres(at)hive:~$ addr2line -e /usr/pgsql17/bin/postgres -a -f 0x15f84c
0x000000000015f84c
get_collation_isdeterministic
/git/postgres/src/backend/utils/cache/lsyscache.c:1062 (discriminator 4)
postgres(at)hive:~$

in gdb it will point right to our new nop:
0x000000000015f840 <-4490944>: lea 0x608306(%rip),%rdi
# 0x767b4d
0x000000000015f847 <-4490937>: call 0x5be1d0 <errfinish>
0x000000000015f84c <-4490932>: nop

Thoughts? Does it make sense to post a patch (for pg18?)? How to do it
in $arch-independent way? I've tried also to try to find a way with
e.g. -rdynamic to show that real function name, but it looks like
without some more serious unwind library it seems unrealistic (?) to
resolve that get_collation_isdeterministic.cold

-J.

[1] - https://www.postgresql.org/message-id/flat/CAApHDvoWoxtbWiqZxrhO%2Bi9NoG56AWHDzuDDD%2B1OEf4PxzFhig%40mail.gmail.com#611566bd6fd06f27e51dbc3148a673d3
[2] - https://www.postgresql.org/message-id/flat/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68%3D3awbh8CJWTP9zXeoHMw%40mail.gmail.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-03-28 12:03:15 Re: Support logical replication of DDLs
Previous Message Andrey M. Borodin 2024-03-28 12:00:46 Re: Support logical replication of DDLs