| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | Matheus Alcantara <matheusssilv97(at)gmail(dot)com> |
| Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, jian he <jian(dot)universality(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: postgres_fdw: Use COPY to speed up batch inserts |
| Date: | 2025-11-18 02:03:26 |
| Message-ID: | CAD21AoDFisd4HYcDspqkzoM9sMofba8aBMgB0RkG8uCqt-ZQUg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, Nov 6, 2025 at 3:49 PM Matheus Alcantara
<matheusssilv97(at)gmail(dot)com> wrote:
>
> On Fri Oct 31, 2025 at 4:02 PM -03, I wrote:
> > It's showing a bit complicated to decide at runtime if we should use the
> > COPY or INSERT for batch insert into a foreign table. Perhaps we could
> > add a new option on CREATE FOREIGN TABLE to enable this usage or not? We
> > could document the performance improvements and the limitations so the
> > user can decide if it should enable or not.
> >
> Here is v5 that implement this idea.
>
> On this version I've introduced a foreign table and foreign server
> option "use_copy_for_insert" (I'm open for a better name) that enable
> the use of the COPY as remote command to execute an INSERT into a
> foreign table. The COPY can be used if the user enable this option on
> the foreign table or the foreign server and if the original INSERT
> statement don't have a RETURNING clause.
>
> See the benchmark results:
>
> pgbench -n -c 10 -j 10 -t 100 -f bench.sql postgres
>
> Master (batch_size = 1 with a single row to insert):
> tps = 16000.768037
>
> Master (batch_size = 1 with 1000 rows to insert):
> tps = 133.451518
>
> Master (batch_size = 100 with 1000 rows to insert):
> tps = 1274.096347
>
> -----------------
>
> Patch(batch_size = 1, use_copy_for_insert = false with single row to
> insert)
> tps = 15734.155705
>
> Master (batch_size = 1, use_copy_for_insert = false with 1000 rows to
> insert):
> tps = 132.644801
>
> Master (batch_size = 100, use_copy_for_insert = false with 1000 rows to
> insert):
> tps = 1245.514591
>
> -----------------
>
> Patch(batch_size = 1, use_copy_for_insert = true with single row to
> insert)
> tps = 17604.394057
>
> Master (batch_size = 1, use_copy_for_insert = true with 1000 rows to
> insert):
> tps = 88.998804
>
> Master (batch_size = 100, use_copy_for_insert = true with 1000 rows to
> insert):
> tps = 2406.009249
>
> -----------------
>
> We can see that when batching inserting with the batch_size configured
> properly we have a very significant performance improvement and when the
> "use_copy_for_insert" option is disabled the performance are close
> compared with master.
>
> The problem is when the "batch_size" is 1 (default) and
> "use_copy_for_insert" is enabled. This is because on this scenario we
> are sending multiple COPY commands with a single row to the foreign
> server.
>
> One way to fix this would to decide at runtime (at
> execute_foreign_modify()) if the COPY can be used based on the number of
> rows being insert. I don't think that I like this option because it
> would make the EXPLAIN output different when the ANALYZE option is used
> since during planning time we don't have the number of rows being
> inserted, so if just EXPLAIN(VERBOSE) is executed we would show the
> INSERT as remote SQL, and if the ANALYZE is included and we have enough
> rows to enable the COPY usage, the remote SQL would show the COPY
> command.
>
> Since the new "use_copy_for_insert" option is be disabled by default I
> think that we could document this limitation and mention the performance
> improvements when used correctly with the batch_size option.
>
> Another option would be to use the COPY command only if the
> "use_copy_for_insert" is true and also if the "batch_size" is > 1. We
> would still have the performance issue if the user insert a single row
> but we would close to less scenarios. The attached 0002 implement this
> idea.
>
> Thoughts?
IIUC the performance regression occurs when users insert many rows
into a foreign table with batch_size = 1 and use_copy_for_insert =
true (tps: 133.451518 vs. 132.644801 vs. 88.998804). Since batch_size
defaults to 1, users might experience performance issues if they
enable use_copy_for_insert without adjusting the batch_size. I'm
worried that users could easily find themselves in this situation.
One possible solution would be to introduce a threshold, like
copy_min_row, which would specify the minimum number of rows needed
before switching to the COPY command. However, this would require
coordination with batch_size since having copy_min_row lower than
batch_size wouldn't make sense. Alternatively, when users are using
batch insertion (batch_size > 0), we could use the COPY command only
for full batches and fall back to INSERT for partial ones.
BTW I noticed that use_copy_for_insert option doesn't work with COPY
FROM command. I got the following error with use_copy_for_insert=true
and batch_size=3:
postgres(1:2546195)=# copy t from '/tmp/a.csv'; -- table 't' is a foreign table.
ERROR: there is no parameter $1
CONTEXT: remote SQL command: INSERT INTO public.t(c) VALUES ($1)
COPY t
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | ocean_li_996 | 2025-11-18 02:05:18 | Re: minor improvement in snapbuild: use existing interface and remove fake code |
| Previous Message | Peter Smith | 2025-11-18 01:57:33 | Re: [PATCH] Add pg_get_subscription_ddl() function |