Re: elog/ereport VS misleading backtrace_function function address

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: elog/ereport VS misleading backtrace_function function address
Date: 2024-05-07 07:43:36
Message-ID: CAKZiRmyn6Zrr3fAdOeEnSwNA7cqT8hRSFFwQZ9vDx1boZmukkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom and -hackers!

On Thu, Mar 28, 2024 at 7:36 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> writes:
> > While chasing some other bug I've learned that backtrace_functions
> > might be misleading with top elog/ereport() address.
>
> That was understood from the beginning: this type of backtrace is
> inherently pretty imprecise, and I doubt there is much that can
> be done to make it better. IIRC the fundamental problem is it only
> looks at global symbols, so static functions inherently defeat it.
> It was argued that this is better than nothing, which is true, but
> you have to take the info with a mountain of salt.

OK, point taken, thanks for describing the limitations, still I find
backtrace_functions often the best thing we have primarily due its
simplicity and ease of use (for customers). I found out simplest
portable way to generate NOP (or any other instruction that makes the
problem go away):

with the reproducer, psql returns:

psql:ri-collation-bug-example.sql:48: error: ERROR: XX000: cache
lookup failed for collation 0
LOCATION: get_collation_isdeterministic, lsyscache.c:1062

logfile without patch:

2024-05-07 09:05:43.043 CEST [44720] ERROR: cache lookup failed for collation 0
2024-05-07 09:05:43.043 CEST [44720] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x155883) [0x55ce5a86a883]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x55ce5ac6dfd0]
postgres: postgres postgres [local] DELETE(+0x2d488b) [0x55ce5a9e988b]
[..]

$ addr2line -e /usr/pgsql18/bin/postgres -a -f 0x155883
0x0000000000155883
get_constraint_type.cold <<<<<< so it's wrong as the real function
should be get_collation_isdeterministic

logfile with attached patch:

2024-05-07 09:11:06.596 CEST [51185] ERROR: cache lookup failed for collation 0
2024-05-07 09:11:06.596 CEST [51185] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x168bf0) [0x560e1cdfabf0]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x560e1d200c00]
postgres: postgres postgres [local] DELETE(+0x2e90fb) [0x560e1cf7b0fb]
[..]

$ addr2line -e /usr/pgsql18/bin/postgres -a -f 0x168bf0
0x0000000000168bf0
get_collation_isdeterministic.cold <<< It's ok with the patch

NOTE: in case one will be testing this: one cannot ./configure with
--enable-debug as it prevents the compiler optimizations that actually
end up with the ".cold" branch optimizations that cause backtrace() to
return the wrong address.

> I recall speculating about whether we could somehow invoke gdb
> to get a more comprehensive and accurate backtrace, but I don't
> really have a concrete idea how that could be made to work.

Oh no, I'm more about micro-fix rather than doing some bigger
revolution. The patch simply adds this one instruction in macro, so
that now backtrace_functions behaves better:

0x0000000000773d28 <+105>: call 0x79243f <errfinish>
0x0000000000773d2d <+110>: movzbl -0x12(%rbp),%eax << this ends
up being added by the patch
0x0000000000773d31 <+114>: call 0xdc1a0 <abort(at)plt>

I'll put that as for PG18 in CF, but maybe it could be backpatched too
- no hard opinion on that (I don't see how that ERROR/FATAL path could
cause any performance regressions)

-J.

Attachment Content-Type Size
v1-0001-Tweak-ereport-so-that-it-generates-proper-address.patch application/octet-stream 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2024-05-07 08:17:24 Re: cataloguing NOT NULL constraints
Previous Message Anton A. Melnikov 2024-05-07 07:37:27 Re: psql: fix variable existence tab completion