| From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
|---|---|
| To: | Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz> |
| Subject: | Re: Use-after-free issue in postgres_fdw |
| Date: | 2026-03-20 08:27:21 |
| Message-ID: | A6EA4C4C-37ED-42F2-9FF5-01E815D8083F@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On Mar 19, 2026, at 22:56, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:
>
> Hi,
>
> I got an offline report from my colleague Zhibai Song that
> close_cursor() is called for a freed PGconn, leading to a server
> crash. Here is a reproducer (the original reproducer he provided is a
> bit complex, so I simplified it):
>
> create server loopback
> foreign data wrapper postgres_fdw
> options (dbname 'postgres');
> create user mapping for current_user server loopback;
> create table t1 (id int, data text);
> create foreign table ft1 (id int, data text)
> server loopback options (table_name 't1');
> insert into ft1 values (1, 'foo');
> start transaction;
> -- This caches the remote connection's PGconn in PgFdwScanState
> declare c1 cursor for select * from ft1;
> fetch c1;
> id | data
> ----+------
> 1 | foo
> (1 row)
>
> savepoint s1;
> select * from ft1;
> id | data
> ----+------
> 1 | foo
> (1 row)
>
> select pid from pg_stat_activity
> where datname = 'postgres'
> and application_name = 'postgres_fdw';
> pid
> -------
> 91853
> (1 row)
>
> -- This terminates the remote session
> select pg_terminate_backend(91853);
> pg_terminate_backend
> ----------------------
> t
> (1 row)
>
> -- This leaves the remote connection's changing_xact_state as true
> rollback to s1;
> savepoint s1;
> -- This calls pgfdw_reject_incomplete_xact_state_change(), freeing
> -- the remote connection's PGconn as changing_xact_state is true
> select * from ft1;
> ERROR: connection to server "loopback" was lost
> rollback to s1;
> -- This calls close_cursor() on the PGconn cached in PgFdwScanState,
> -- which was freed above, leading to a server crash
> close c1;
>
> I think the root cause is that it is too early to free the PGconn in
> pgfdw_reject_incomplete_xact_state_change() even if the connection is
> in a state where we cannot use it any further; I think we should delay
> that until abort cleanup (ie, pgfdw_xact_callback()). Attached is a
> patch for that.
>
> Best regards,
> Etsuro Fujita
> <fix-connection-handling-in-postgres-fdw.patch>
Hi Etsuro-san,
I can reproduce the server crash following your procedure, and I traced the problem.
The issue is that, during select * from ft1, pgfdw_reject_incomplete_xact_state_change() calls disconnect_pg_server(), which destroys conn and sets ConnCacheEntry->conn = NULL, but does not update PgFdwScanState->conn. As a result, when "close c1" is executed later, PgFdwScanState->conn points to stale memory with random contents.
I am not sure we should still allow further commands to run after select * from ft1, given that it has already raised: "ERROR: connection to server "loopback" was lost”. Maybe we should not keep going as if the connection were still there.
I am not very familiar with the FDW code, so I am not ready to suggest a concrete fix. But it seems wrong to let later paths keep using PgFdwScanState->conn after select * from ft1 has already failed with connection loss. My guess is that we either need to invalidate all dependent state when disconnect_pg_server() runs, or otherwise prevent later cleanup paths from touching the cached PGconn *.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ashutosh Sharma | 2026-03-20 08:38:18 | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication |
| Previous Message | Amit Langote | 2026-03-20 08:20:04 | Re: Eliminating SPI / SQL from some RI triggers - take 3 |