Re: Add LIMIT option to COPY FROM

From: Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>
To: Florents Tselai <florents(dot)tselai(at)gmail(dot)com>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add LIMIT option to COPY FROM
Date: 2026-02-04 01:06:28
Message-ID: CAOzEurRhdikO-pFnR+eQHT6Tdi05KHbSYMstPunTrfnrDLedew@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for the comments.

On Tue, Feb 3, 2026 at 11:12 PM David G. Johnston
<david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>> Syntax example:
>> - COPY t FROM STDIN (LIMIT 100);
>>
>> This feature is useful for:
>> - Loading only the first N rows from a huge CSV file to verify data or
>> table definitions before a full import
>
>
> Would want it paired with offset for this use case.

COPY FROM already accepts HEADER <integer> to skip N leading lines, so
you can combine it with LIMIT to get OFFSET + LIMIT semantics today:
=# COPY t FROM 'file.csv' (HEADER 100, LIMIT 50);

>> Design:
>> - The LIMIT count applies after WHERE filtering and ON_ERROR skipping,
>> so it represents the actual number of rows inserted.
>
>
> Not sure about that choice. I’d go with pre-eval or implement both and default to pre-eval.

It is consistent with SQL semantics — SELECT ... WHERE ... LIMIT N
counts rows that pass the filter, not rows scanned. Pre-eval behavior
is already achievable externally (head -n), while post-eval can only
be done server-side, which makes it the more valuable choice for a
built-in option.

On Tue, Feb 3, 2026 at 11:41 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Shinya Kato <shinya11(dot)kato(at)gmail(dot)com> writes:
> > I'd like to propose adding a LIMIT option to COPY FROM, which limits
> > the number of rows to load.
>
> Do we really need this? Each random feature we load onto COPY
> slows it down for everybody.

When LIMIT is not specified (the default), the only overhead is a
single if (limit > 0) branch per row on a struct field already in L1
cache — the same class of cost as the existing ON_ERROR and WHERE
checks.

This is not a novel feature either. Oracle SQL*Loader provides LOAD,
and SQL Server BULK INSERT provides LASTROW. The absence of LIMIT in
COPY FROM is arguably a gap relative to other RDBMSes.

On Wed, Feb 4, 2026 at 12:07 AM Florents Tselai
<florents(dot)tselai(at)gmail(dot)com> wrote:
> I work with such scenarios a lot and I can't see why COPY should worry itself about such filtering.
> IRL most of what goes into COPY'S STDIN has already been filtered extensively,
> like ... | head -n 100 | COPY t FROM STDIN

head -n works for the STDIN-pipe workflow, but not for all COPY FROM scenarios:
- Server-side files via remote psql: COPY t FROM
'/server/path/file.csv' is read directly by the server process. A
client connected over the network has no way to interpose a pipe on
that I/O path.
- Interaction with WHERE / ON_ERROR: head -n 100 limits input lines,
but cannot guarantee a specific number of inserted rows when some rows
are filtered by WHERE or skipped by ON_ERROR. That control is only
possible server-side.

The same "do it outside" argument could be made against the WHERE
clause ("just use grep"), yet WHERE was accepted because server-side
filtering provides value that external tools cannot fully replicate.
LIMIT fills the same kind of gap.

--
Best regards,
Shinya Kato
NTT OSS Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2026-02-04 01:46:22 Re: Add LIMIT option to COPY FROM
Previous Message Chao Li 2026-02-04 01:04:57 Re: Use allocation macros in the logical replication code