| From: | Alexander Pyhalov <a(dot)pyhalov(at)postgrespro(dot)ru> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Limit memory usage by postgres_fdw batches |
| Date: | 2025-12-30 09:02:00 |
| Message-ID: | acac546238ae6ea0ffd4036a48283262@postgrespro.ru |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Tomas Vondra писал(а) 2025-12-27 16:18:
> Hi Alexander,
>
> On 12/26/25 11:54, Alexander Pyhalov wrote:
>> Hi.
>>
>> We had some real cases when client set rather big batch_size on server
>> level, but for some foreign table, containing large documents, it was
>> inadequate and lead to OOM killer intervention. You can argue that
>> batch_size can be set on foreign table level, but it can still be not
>> flexible enough, when tuple size varies.
>
> I agree this can be an issue with large tuples. It'd be good to
> consider
> the size, not just the number of tuples.
>
> I think you modified the right places, but I think there are two or
> three issues we should improve:
>
> 1) It calls estimate_batch_length() for every ExecInsert() call, i.e.
> for every tuples. And it walks all tuples up to that point, which makes
> it O(N^2). I haven't measured how significant it is, but AFAICS we
> could
> track the current size of the batch fairly easily, and use that.
>
> 2) work_mem is in kilobytes, while batch_len is in bytes, so the
> comparison (batch_len > work_mem) is not quite right. I'll probably
> fire
> every time, preventing any batching.
>
> 3) Isn't this consider the size of the new tuple in batch_len? Imagine
> the tuples are 99% of the work_mem limit. We add the first one. When
> adding the next one we check the current batch is below work_mem, and
> so
> we proceed to add the second tuple. Now the batch is 1.98% of the
> limit.
>
> I think it should work like this:
>
> 1) batch_len + tup_len < work_mem => add tuple to batch
> 2) tup_len < work_mem => flush batch, add tuple to batch
> 3) tup_len => work_mem => flush batch, insert tuple directly
>
> What bothers me a little bit is that this is per relation. AFAICS when
> inserting into a partitioned table with multiple foreign partitions,
> each partition will have a separate limit. I wonder if we could do
> better and have some sort of "global" limit for the whole insert.
>
> But that's not the fault of this patch, of course.
Hi. Thank you for feedback.
I've fixed the first two issues in the first patch.
I've tried to implement switching from batch to usual foreign insert
(it's in the second patch). However, I don't like it. It changes the
resultRelInfo->ri_NumSlots interface. Now it can be 0 when BatchInsert
was followed by foreign insert, and ExecPendingInserts() should handle
this. Also we can't longer keep fact that BatchInsert() was performed
private to ExecInsert(), as we have to determine if
estate->es_insert_pending_result_relations and
estate->es_insert_pending_modifytables lists were modified between
ExecInsert() calls. In theory we could look at
list_member_ptr(estate->es_insert_pending_result_relations,
resultRelInfo), but I don't know if iterating over this list per each
tuple is a good idea.
>
>> I suppose this case is also
>> takes place for fetch_size. Issue here is that we can't somehow limit
>> size of data (versus number of rows) while fetching from cursor. But
>> we
>> can use tuple store to preserve fetched results, so that they spill
>> out
>> to the disk.
>>
>
> Perhaps. Seems like a separate issue. I haven't looked very closely,
> but
> do we want to use the tuplestore always, or just when the tuples get
> too
> large? It might even be on batch-by-batch, I guess. With fetch_size=1
> it's hardly useful, right? Is the tuplestore management measurable?
I haven't noticed significant impact on simple pgbench tests. I will
check different cases with
fetch_size 1 - for now I don't understand if we could trigger OOM for
example with async foreign
scan of partitioned table with many foreign partitions - now we would
have to store each tuple
we get in the memory, perhaps, saving it to the store until it's needed
by Append is not so bad idea.
>
>> I'm attaching two patches which try to fix issues with possible huge
>> memory usage by postgres_fdw batches.
>> With fetched tuples we still can't use only tuplestore, as ctids are
>> not
>> preserved, and so have to store them separately.
>>
>> The reproducer for insert is simple.
>>
>> create extension postgres_fdw ;
>> create server loopback foreign data wrapper postgres_fdw options
>> (dbname
>> 'postgres', port '5432', batch_size '100', fetch_size '100');
>> create table base_table(i int, s bytea);
>> create foreign table foreign_table (i int, s bytea) server loopback
>> options(table_name 'base_table');
>> create user mapping for public server loopback ;
>>
>> insert into foreign_table select i, pg_read_binary_file('/some/big/
>> file') from generate_series(1,1000) i;
>>
>> will easily grow backend RSS to several gigabytes.
>> The first patch fixes this problem.
>>
>> The second patch alleviates the second issue - SELECT * queries also
>> can
>> grow backend memory to several GBs. Still memory usage can peak (on my
>> toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than
>> non-
>> patched version.
>>
>
> How large are the tuples? How much higher was this RSS than the
> theoretical minimum?
>
Sorry, have lost the exact reproducer. Memory consumption of course
depends on tuple size and fetch size.
Remade some simple tests.
select * from pg_foreign_server ;
oid | srvname | srvowner | srvfdw | srvtype | srvversion | srvacl |
srvoptions
-------+----------+----------+--------+---------+------------+--------+-----------------------------------------------------------
16403 | loopback | 10 | 16392 | | | |
{dbname=postgres,port=5432,batch_size=100,fetch_size=100}
(1 row)
select count(*) from foreign_table ;
count
-------
1100
(1 row)
\d+ base_table
Table "public.base_table"
Column | Type | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
i | integer | | | | plain |
| |
s | bytea | | | | extended |
| |
Access method: heap
select pg_size_pretty(pg_total_relation_size('base_table')) ;
pg_size_pretty
----------------
5495 MB
(1 row)
postgres=# select
pg_size_pretty(pg_total_relation_size('base_table')/1100) ;
pg_size_pretty
----------------
5116 kB
(1 row)
Backend, executing
"select i, length(s) from foreign_table "
query takes up to 1.4 GB RSS on master and up to 1 GB on patched
version.
P.S. Going to be on vacation for the first half of January.
--
Best regards,
Alexander Pyhalov,
Postgres Professional
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Limit-batch_size-for-foreign-insert-with-work_mem.patch | text/x-diff | 2.7 KB |
| v2-0002-Fall-back-to-foreign-insert-when-tuple-is-too-big.patch | text/x-diff | 10.2 KB |
| v2-0003-Use-tuplestore-in-PgFdwScanState-scan-state-to-limit.patch | text/x-diff | 5.8 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2025-12-30 09:03:51 | Re: Refactor to eliminate cast-away-const in pg_dump object sort comparator |
| Previous Message | Bertrand Drouvot | 2025-12-30 08:13:32 | Re: Refactor query normalization into core query jumbling |