Bulk DML performance

From: <bill(dot)poole(at)ymail(dot)com>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Bulk DML performance
Date: 2025-03-13 04:05:34
Message-ID: 008701db93cd$2eb404d0$8c1c0e70$@ymail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello! I'm building a system that needs to insert/update batches of millions
of rows (using INSERT .. ON CONFLICT (.) DO UPDATE) in a single database
transaction, where each row is about 1.5 kB. The system produces about 3
million rows (about 4.5 GB) of data in about 5 seconds, but PostgreSQL takes
about 35 seconds to insert that data and about 55 seconds to update that
data. This is both on my local dev machine as well as on a large AWS Aurora
PostgreSQL instance (db.r8g.16xlarge with 64 vCPUs, 512 GB RAM and 30 Gbps).

The following INSERT .. ON CONFLICT (.) DO UPDATE statement inserts/updates
3 million rows with only 9 bytes per row and takes about 8 seconds on first
run (to insert the rows) and about 14 seconds on subsequent runs (to update
the rows), but is only inserting 27 MB of data (3 million rows with 9 bytes
per row); although after the first run, SELECT
pg_size_pretty(pg_total_relation_size('test')) reports the table size as 191
MB and after the second run reports the table size as 382 MB (adding another
191 MB).

CREATE TABLE test (

id bigint PRIMARY KEY,

text1 text

);

INSERT INTO test (id, text1)

SELECT generate_series, 'x'

FROM generate_series(1, 3000000)

ON CONFLICT (id) DO UPDATE

SET text1 = 'x';

If PostgreSQL is writing 191 MB on the first run and 382 MB on each
subsequent run, then PostgreSQL is only writing about 28 MB/s. Although
PostgreSQL is also able to write about 4.5 GB in about 35 seconds (as stated
above), which is about 128 MB/s, so it seems the performance constraint
depends on the number of rows inserted more than the size of each row.

Furthermore, deleting the rows takes about 18 seconds to perform (about 4
seconds longer than the time taken to update the rows):

DELETE FROM test

WHERE id in (

SELECT * FROM generate_series(1, 3000000)

)

It seems like it should be possible to do better than this on modern
hardware, but I don't have enough knowledge of the inner workings of
PostgreSQL to know whether my instinct is correct on this, so I thought I'd
raise the question with the experts.

Thanks!

Bill

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Laurenz Albe 2025-03-13 09:21:26 Re: Bulk DML performance
Previous Message Greg Sabino Mullane 2025-03-07 13:24:57 Re: [EXTERNAL] Re: Asking for OK for a nasty trick to resolve PG CVE-2025-1094 i