Re: Use-after-free issue in postgres_fdw

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: Use-after-free issue in postgres_fdw
Date: 2026-03-20 08:27:21
Message-ID: A6EA4C4C-37ED-42F2-9FF5-01E815D8083F@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mar 19, 2026, at 22:56, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:
>
> Hi,
>
> I got an offline report from my colleague Zhibai Song that
> close_cursor() is called for a freed PGconn, leading to a server
> crash. Here is a reproducer (the original reproducer he provided is a
> bit complex, so I simplified it):
>
> create server loopback
> foreign data wrapper postgres_fdw
> options (dbname 'postgres');
> create user mapping for current_user server loopback;
> create table t1 (id int, data text);
> create foreign table ft1 (id int, data text)
> server loopback options (table_name 't1');
> insert into ft1 values (1, 'foo');
> start transaction;
> -- This caches the remote connection's PGconn in PgFdwScanState
> declare c1 cursor for select * from ft1;
> fetch c1;
> id | data
> ----+------
> 1 | foo
> (1 row)
>
> savepoint s1;
> select * from ft1;
> id | data
> ----+------
> 1 | foo
> (1 row)
>
> select pid from pg_stat_activity
> where datname = 'postgres'
> and application_name = 'postgres_fdw';
> pid
> -------
> 91853
> (1 row)
>
> -- This terminates the remote session
> select pg_terminate_backend(91853);
> pg_terminate_backend
> ----------------------
> t
> (1 row)
>
> -- This leaves the remote connection's changing_xact_state as true
> rollback to s1;
> savepoint s1;
> -- This calls pgfdw_reject_incomplete_xact_state_change(), freeing
> -- the remote connection's PGconn as changing_xact_state is true
> select * from ft1;
> ERROR: connection to server "loopback" was lost
> rollback to s1;
> -- This calls close_cursor() on the PGconn cached in PgFdwScanState,
> -- which was freed above, leading to a server crash
> close c1;
>
> I think the root cause is that it is too early to free the PGconn in
> pgfdw_reject_incomplete_xact_state_change() even if the connection is
> in a state where we cannot use it any further; I think we should delay
> that until abort cleanup (ie, pgfdw_xact_callback()). Attached is a
> patch for that.
>
> Best regards,
> Etsuro Fujita
> <fix-connection-handling-in-postgres-fdw.patch>

Hi Etsuro-san,

I can reproduce the server crash following your procedure, and I traced the problem.

The issue is that, during select * from ft1, pgfdw_reject_incomplete_xact_state_change() calls disconnect_pg_server(), which destroys conn and sets ConnCacheEntry->conn = NULL, but does not update PgFdwScanState->conn. As a result, when "close c1" is executed later, PgFdwScanState->conn points to stale memory with random contents.

I am not sure we should still allow further commands to run after select * from ft1, given that it has already raised: "ERROR: connection to server "loopback" was lost”. Maybe we should not keep going as if the connection were still there.

I am not very familiar with the FDW code, so I am not ready to suggest a concrete fix. But it seems wrong to let later paths keep using PgFdwScanState->conn after select * from ft1 has already failed with connection loss. My guess is that we either need to invalidate all dependent state when disconnect_pg_server() runs, or otherwise prevent later cleanup paths from touching the cached PGconn *.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Sharma 2026-03-20 08:38:18 Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication
Previous Message Amit Langote 2026-03-20 08:20:04 Re: Eliminating SPI / SQL from some RI triggers - take 3