| From: | Shinya Kato <shinya11(dot)kato(at)gmail(dot)com> |
|---|---|
| To: | Florents Tselai <florents(dot)tselai(at)gmail(dot)com>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Add LIMIT option to COPY FROM |
| Date: | 2026-02-04 01:06:28 |
| Message-ID: | CAOzEurRhdikO-pFnR+eQHT6Tdi05KHbSYMstPunTrfnrDLedew@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Thank you for the comments.
On Tue, Feb 3, 2026 at 11:12 PM David G. Johnston
<david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>> Syntax example:
>> - COPY t FROM STDIN (LIMIT 100);
>>
>> This feature is useful for:
>> - Loading only the first N rows from a huge CSV file to verify data or
>> table definitions before a full import
>
>
> Would want it paired with offset for this use case.
COPY FROM already accepts HEADER <integer> to skip N leading lines, so
you can combine it with LIMIT to get OFFSET + LIMIT semantics today:
=# COPY t FROM 'file.csv' (HEADER 100, LIMIT 50);
>> Design:
>> - The LIMIT count applies after WHERE filtering and ON_ERROR skipping,
>> so it represents the actual number of rows inserted.
>
>
> Not sure about that choice. I’d go with pre-eval or implement both and default to pre-eval.
It is consistent with SQL semantics — SELECT ... WHERE ... LIMIT N
counts rows that pass the filter, not rows scanned. Pre-eval behavior
is already achievable externally (head -n), while post-eval can only
be done server-side, which makes it the more valuable choice for a
built-in option.
On Tue, Feb 3, 2026 at 11:41 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Shinya Kato <shinya11(dot)kato(at)gmail(dot)com> writes:
> > I'd like to propose adding a LIMIT option to COPY FROM, which limits
> > the number of rows to load.
>
> Do we really need this? Each random feature we load onto COPY
> slows it down for everybody.
When LIMIT is not specified (the default), the only overhead is a
single if (limit > 0) branch per row on a struct field already in L1
cache — the same class of cost as the existing ON_ERROR and WHERE
checks.
This is not a novel feature either. Oracle SQL*Loader provides LOAD,
and SQL Server BULK INSERT provides LASTROW. The absence of LIMIT in
COPY FROM is arguably a gap relative to other RDBMSes.
On Wed, Feb 4, 2026 at 12:07 AM Florents Tselai
<florents(dot)tselai(at)gmail(dot)com> wrote:
> I work with such scenarios a lot and I can't see why COPY should worry itself about such filtering.
> IRL most of what goes into COPY'S STDIN has already been filtered extensively,
> like ... | head -n 100 | COPY t FROM STDIN
head -n works for the STDIN-pipe workflow, but not for all COPY FROM scenarios:
- Server-side files via remote psql: COPY t FROM
'/server/path/file.csv' is read directly by the server process. A
client connected over the network has no way to interpose a pipe on
that I/O path.
- Interaction with WHERE / ON_ERROR: head -n 100 limits input lines,
but cannot guarantee a specific number of inserted rows when some rows
are filtered by WHERE or skipped by ON_ERROR. That control is only
possible server-side.
The same "do it outside" argument could be made against the WHERE
clause ("just use grep"), yet WHERE was accepted because server-side
filtering provides value that external tools cannot fully replicate.
LIMIT fills the same kind of gap.
--
Best regards,
Shinya Kato
NTT OSS Center
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David G. Johnston | 2026-02-04 01:46:22 | Re: Add LIMIT option to COPY FROM |
| Previous Message | Chao Li | 2026-02-04 01:04:57 | Re: Use allocation macros in the logical replication code |