Re: Bypassing cursors in postgres_fdw to enable parallel plans

From: Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andy Fan <zhihuifan1213(at)163(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Bypassing cursors in postgres_fdw to enable parallel plans
Date: 2025-11-14 16:06:08
Message-ID: CA+FpmFfKtaRHcZ=tVQPRt0qt7xMGCotgbWZ9H0fpLoKA8=ZJjA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 10 Oct 2025 at 22:02, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>
wrote:
> > I am back at this work with a rebased and revised patch. The new
version is rebased and has a change in approach.
> > Whenever we are using non-cursor mode, for the first cursor we are
always saving the tuples
> > in the tuplestore, this is because we do not have any means to know
beforehand how many cursors are required for the query.
>
> This might have the advantage of being simpler, but it's definitely
> worse. If we're only fetching one result set, which will be common,
> we'll buffer the whole thing in a tuplestore where that could
> otherwise be avoided. Maybe it's still best to go with this approach;
> not sure.
>
> > And when we switch to the next query then we do not have a way to fetch
the tuples for the previous query.
> > So, the tuples retrieved earlier for the first query were lost if not
saved.
> > I would highly appreciate your time and feedback for this.
>
> My suggestions are to work on the following areas:
>
> 1. Automated testing. The patch has no regression tests, and won't get
> committed without those.
>
> 2. Manual testing. How does the performance with this new option
> compare to the existing method? The answer might be different for
> small result sets that fit in memory and large ones that spill to
> disk, and will certainly also depend on how productive parallel query
> can be on the remote side; but I think you want to do and post on this
> list some measurements showing the best and worst case for the patch.
>
> 3. Patch clean-up. There are plenty of typos and whitespace-only
> changes in the patch. It's best to clean those up. Running pgindent is
> a good idea too. Some places could use comments.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Thank you Robert, for reviewing this patch. On going through the patch
more, I realised this was not equipped to handle the cases when there are
more than two active cursors. So to accommodate such a case, I now modified
the new struct for saving the previous query to a list of such structs.
Also, it turns out we need not to save the tuples in case this is an active
cursor, so we only populate the associated tuplestore only when we need to
create a new cursor when the old cursor is not completely done.

In this version I have added the regression tests as well. I wanted to test
this patch for all the cases of postgres_fdw, the only way I could figure
out how to do this was to test the select statements with the new GUC.

I also did some tests for performance. I used the contrib_regression
database and populated the table "S1"."T1" with more tuples to understand
the impact of patch on higher scale. I also used auto_explain to get the
foreign plans.

contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:35.825 CET [44338] LOG: duration: 61.336 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
2025-11-14 14:40:35.825 CET [44862] LOG: duration: 60.575 ms plan:
Query Text: DECLARE c1 CURSOR FOR
SELECT count(*) FROM "S 1"."T 1"
Aggregate (cost=21888.22..21888.23 rows=1 width=8)
-> Seq Scan on "T 1" (cost=0.00..19956.98 rows=772498 width=0)
count
--------
990821
(1 row)

Time: 62.728 ms
contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
Time: 1.515 ms
contrib_regression=# select count(*) from ft1;
2025-11-14 14:40:46.260 CET [44862] LOG: duration: 21.875 ms plan:
Query Text: SELECT count(*) FROM "S 1"."T 1"
Finalize Aggregate (cost=17255.64..17255.65 rows=1 width=8)
-> Gather (cost=17255.43..17255.64 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=16255.43..16255.44 rows=1 width=8)
-> Parallel Seq Scan on "T 1" (cost=0.00..15450.74
rows=321874 width=0)
2025-11-14 14:40:46.260 CET [44338] LOG: duration: 22.623 ms plan:
Query Text: select count(*) from ft1;
Foreign Scan (cost=102.50..123.72 rows=1 width=8)
Relations: Aggregate on (ft1)
count
--------
990821
(1 row)

Time: 24.862 ms

So for this query, the advantage is coming from parallel query which was
otherwise not possible in this scenario.
To study more performance with this patch, I found another interesting
query and here are the results,

contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# explain (analyse, buffers) SELECT t1."C 1" FROM "S
1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 =
t1."C 1");2025-11-14 15:57:46.893 CET [1946]
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=112311.578..112804.516 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp read=12754 written=12754
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4)
(actual time=0.039..48.808 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=112310.386..112310.387 rows=990821.00 loops=1)
Buckets: 262144 (originally 1024) Batches: 8 (originally 1)
Memory Usage: 6408kB
Buffers: temp written=2537
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4)
(actual time=0.728..112030.241 rows=990821.00 loops=1)
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000
width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1
width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
Planning:
Buffers: shared hit=5
Planning Time: 2.211 ms
Execution Time: 112825.428 ms
(15 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
explain (analyse, buffers) SELECT t1."C 1" FROM "S 1"."T 1" t1 left join
ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
time=261.416..354.520 rows=990821.00 loops=1)
Hash Cond: (t1."C 1" = t3.c1)
Buffers: shared hit=12232, temp written=2660
-> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4) (actual
time=0.021..35.531 rows=990821.00 loops=1)
Buffers: shared hit=12232
-> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
time=261.381..261.383 rows=100.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
Buffers: temp written=2660
-> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual
time=255.563..261.356 rows=100.00 loops=1)
Buffers: temp written=2660
-> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4)
(actual time=0.433..0.443 rows=100.00 loops=1)
-> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4)
(actual time=2.609..2.609 rows=1.00 loops=100)
Buffers: temp written=2660
Planning:
Buffers: shared hit=5
Planning Time: 2.284 ms
Execution Time: 389.358 ms
(17 rows)

So even in the case without a parallel plan, it is performing significantly
better. I investigated a bit more to find out why the query was so slow
with the cursors,
and came to understand that it is repeatedly abandoning and recreating the
cursor via the code path of postgresReScanForeignScan.
So, it looks like cursor management itself costs this much more time.

To understand the impact of the patch in case of tuples spilled to disk, I
tried following example,
contrib_regression=# SET postgres_fdw.use_cursor = true;
SET
contrib_regression=# set enable_hashjoin = off;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
where a.c1 = b.c1 order by a.c2;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=4537.437..4598.483 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=3748.488..4090.547 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1818.521..1865.792 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1929.955..1981.104 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
Planning Time: 0.479 ms
Execution Time: 4661.872 ms
(19 rows)

contrib_regression=# SET postgres_fdw.use_cursor = false;
SET
contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
where a.c1 = b.c1 order by a.c2;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=831.49..833.99 rows=1000 width=94) (actual
time=3376.385..3435.406 rows=990821.00 loops=1)
Sort Key: a.c2
Sort Method: external merge Disk: 137768kB
Buffers: temp read=83156 written=83253
-> Merge Join (cost=761.66..781.66 rows=1000 width=94) (actual
time=2565.517..2916.814 rows=990821.00 loops=1)
Merge Cond: (a.c1 = b.c1)
Buffers: temp read=48725 written=48779
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1249.517..1300.132 rows=990821.00 loops=1)
Sort Key: a.c1
Sort Method: external merge Disk: 75664kB
Buffers: temp read=18910 written=18937
-> Foreign Scan on ft1 a (cost=100.00..331.00 rows=1000
width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
-> Sort (cost=380.83..383.33 rows=1000 width=47) (actual
time=1315.990..1369.576 rows=990821.00 loops=1)
Sort Key: b.c1
Sort Method: external sort Disk: 79520kB
Buffers: temp read=29815 written=29842
-> Foreign Scan on ft1 b (cost=100.00..331.00 rows=1000
width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
Planning Time: 0.491 ms
Execution Time: 3527.457 ms
(19 rows)

So it doesn't hurt in this case either. But that is because not the
tuplestore which is added in this patch is spilling to disk here.
I am working on finding a case when there are two or more active cursors
and then when storing the tuples of one cursor,
they are spilled onto the disk. That might give a picture of the worst case
scenario of this patch.

Also, thanks to Kenan, a fellow hacker who finds this work interesting and
offered to do some performance analysis for this patch,
maybe he can also post more results here.

--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH

Attachment Content-Type Size
v3-0001-Fetch-without-cursors.patch application/octet-stream 405.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2025-11-14 16:58:59 Re: PartitionKeyData->partattrs, refactor some 0 to InvalidAttrNumber
Previous Message Joel Jacobson 2025-11-14 16:01:59 Re: Optimize LISTEN/NOTIFY