Re: Emitting JSON to file using COPY TO

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: jian he <jian(dot)universality(at)gmail(dot)com>, Junwang Zhao <zhjwpku(at)gmail(dot)com>
Cc: Florents Tselai <florents(dot)tselai(at)gmail(dot)com>, Joe Conway <mail(at)joeconway(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, Davin Shearer <davin(at)apache(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Emitting JSON to file using COPY TO
Date: 2026-03-04 15:51:14
Message-ID: 74d3b4e2-d201-489d-9b9a-1a6d0eb492a6@dunslane.net
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers


On 2026-02-08 Su 10:48 PM, jian he wrote:
> On Sat, Feb 7, 2026 at 9:27 PM Junwang Zhao<zhjwpku(at)gmail(dot)com> wrote:
>> Here are some comments on v23:
>>
>> 0001: The refactor looks straightforward to me. Introducing a format
>> field should make future extensions easier. One suggestion is that we
>> could add some helper macros around format, for example:
>>
>> #define IS_FORMAT_CSV(format) (format == COPY_FORMAT_CSV)
>> #define IS_FORMAT_TEXT_LIKE(format) \
>> (format == COPY_FORMAT_TEXT || format == COPY_FORMAT_CSV)
>>
>> I think this would improve readability.
> Personally, I don't like marcos....
>
>> 0002: Since you have moved the `CopyFormat enum` into 0001, the
>> following commit msg should be rephrased.
>>
>> The CopyFormat enum was originally contributed by Joel Jacobson
>> joel(at)compiler(dot)org, later refactored by Jian He to address various issues, and
>> further adapted by Junwang Zhao to support the newly introduced CopyToRoutine
>> struct (commit 2e4127b6d2).
>>
>> - if (opts_out->format == COPY_FORMAT_BINARY && opts_out->delim)
>> - ereport(ERROR,
>> - (errcode(ERRCODE_SYNTAX_ERROR),
>> - /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
>> - errmsg("cannot specify %s in BINARY mode", "DELIMITER")));
>> + if (opts_out->delim)
>> + {
>> + if (opts_out->format == COPY_FORMAT_BINARY)
>> + ereport(ERROR,
>> + errcode(ERRCODE_SYNTAX_ERROR),
>> + /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
>> + errmsg("cannot specify %s in BINARY mode", "DELIMITER"));
>> + else if (opts_out->format == COPY_FORMAT_JSON)
>> + ereport(ERROR,
>> + errcode(ERRCODE_SYNTAX_ERROR),
>> + errmsg("cannot specify %s in JSON mode", "DELIMITER"));
>> + }
>>
>> Can we add a function that converts CopyFormat to a string? Treating
>> CopyFormat as %s in error messages would make the code shorter.
>> However, I'm not sure whether this aligns with translation
>> conventions, correct me if I'm wrong.
>>
> I don’t think this is worth the added complexity.
> That said, I tried to simplify the code and changed it to:
>
> if (opts_out->delim &&
> (opts_out->format == COPY_FORMAT_BINARY ||
> opts_out->format == COPY_FORMAT_JSON))
> ereport(ERROR,
> errcode(ERRCODE_SYNTAX_ERROR),
> opts_out->format == COPY_FORMAT_BINARY
> ? errmsg("cannot specify %s in BINARY mode", "DELIMITER")
> : errmsg("cannot specify %s in JSON mode", "DELIMITER"));
>
>> - * CSV and text formats share the same TextLike routines except for the
>> + * CSV and text, json formats share the same TextLike routines except for the
>>
>> I'd suggest rewording to `CSV, text and json ...`. The same applied to
>> other parts in this patch.
>>
> sure.
>
>> 0003: The commit message includes some changes(adapt the newly
>> introduced CopyToRoutine) that actually belong to 0002; it would be
>> better to remove them from this commit.
>>
> 0002 commit message:
> """
> This introduces the JSON format option for the COPY TO command, allowing users
> to export query results or table data directly as a single JSON object or a
> stream of JSON objects.
>
> The JSON format is currently supported only for COPY TO operations; it
> is not available for COPY FROM.
>
> JSON format is incompatible with some standard text/CSV parsing or
> formatting options,
> including:
> - HEADER
> - DEFAULT
> - NULL
> - DELIMITER
> - FORCE QUOTE / FORCE NOT NULL
>
> Regression tests covering valid JSON exports and error handling for
> incompatible options have been added to src/test/regress/sql/copy.sql.
> """
>
> 0003 commit message:
> """
> Add option force_array for COPY JSON FORMAT
> This adds the force_array option, which is available exclusively
> when using COPY TO with the JSON format.
>
> When enabled, this option wraps the output in a top-level JSON array
> (enclosed in square brackets with comma-separated elements), making the
> entire result a valid single JSON value. Without this option, the default
> behavior is to output a stream of independent JSON objects.
>
> Attempting to use this option with COPY FROM or with formats other than
> JSON will raise an error.
> """
>
>> + if (cstate->json_row_delim_needed && cstate->opts.force_array)
>> + CopySendChar(cstate, ',');
>> + else if (cstate->opts.force_array)
>> + {
>> + /* first row needs no delimiter */
>> + CopySendChar(cstate, ' ');
>> + cstate->json_row_delim_needed = true;
>> + }
>>
>> can we do this:
>>
>> if (cstate->opts.force_array)
>> {
>> if (cstate->json_row_delim_needed)
>> CopySendChar(cstate, ',');
>> else
>> {
>> /* first row needs no delimiter */
>> CopySendChar(cstate, ' ');
>> cstate->json_row_delim_needed = true;
>> }
>> }
>>
> good suggestion.
>
> If more people think WRAP_ARRAY is better than FORCE_ARRAY, we can
> switch to it accordingly.
> The change itself is quite straightforward.

I have reworked these. First I cleaned up a number of things in patches
2 and 3 (Thanks, Claude for the summary):

Patch 2: json format for COPY TO

copy.c:
  - "json" → "JSON" in the COPY FROM rejection error message.

  copyto.c:
  1. TupleDesc setup runs once, not every row — Added
json_tupledesc_ready flag; the
memcpy/populate_compact_attribute/BlessTupleDesc block is now guarded by
if (!cstate->json_tupledesc_ready).
  2. Comment rewritten — Old: "the slot's TupleDesc may change during
query execution". New: explains BlessTupleDesc registers the RECORDOID
descriptor so lookup_rowtype_tupdesc inside composite_to_json can
  find it.
  3. Eliminated per-row makeStringInfo() — Added StringInfoData
json_buf to the struct, initialized once in CopyToTextLikeStart in
copycontext. Each row does resetStringInfo instead of allocating a new
  StringInfo.
  4. Column list rejection added — Error: "column selection is not
supported in JSON mode" when attnamelist != NIL.
  5. Improved SendCopyBegin comment — Old: "JSON format is always one
non-binary column". New: explains each CopyData message contains one
complete JSON object.

  Tests:
  6. Added copy copytest (style) to stdout (format json) to the
error-case block.
  7. Added copyjsontype table test with json, jsonb columns — verifies
values are embedded directly, not double-encoded. Covers json objects,
scalars, arrays, nested objects, and nulls.

  Patch 3: Add option force_array

  copy.c:
  1. "json" → "JSON" in COPY FROM error (carried from patch 2).
  2. "can only used" → "can only be used" — grammar fix in FORCE_ARRAY
error.

  copyto.c:
  3. Struct fields reorganized — JSON fields grouped under /* JSON
format state */ comment with inline descriptions, instead of a
standalone json_row_delim_needed with a vague comment.
  4. Block comment updated — Was "CSV, text and json formats share the
same TextLike routines except for the one-row callback". Now correctly
notes JSON has its own one-row and end callbacks.
  5. CopyToTextLikeEnd comment fixed — Was "text, CSV, and json", now
"text and CSV" (JSON uses CopyToJsonEnd).
  6. Cleaner start callback — FORCE_ARRAY bracket emission moved inside
the if (format == JSON) block after json_buf init, instead of a separate
top-level conditional.
  7. Inherits all patch 2 fixes — TupleDesc guard, reusable json_buf,
improved comments, column list rejection.

  Tests:
  8. --Error → -- should fail: force_array requires json format; --ok →
-- force_array variants.
  9. copyjsontype test carried through from patch 2.

Then I reworked the way this works. In order to support column lists
with JSON output, we need to deal with individual columns instead of
whole records. This involved quite a number of changes, as can be seen
in patch 4. This involved exporting a new small function from json.c.

The result is a lot cleaner, I believe, and in my benchmarking is faster
by a factor of almost 2.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2026-03-04 15:55:13 doc: Improve wal_level and effective_wal_level GUC around logical replication
Previous Message David G. Johnston 2026-03-04 15:44:42 Re: pg_plan_advice

Browse pgsql-general by date

  From Date Subject
Previous Message Gianfranco Cocco 2026-03-04 08:32:39 R: PostgreSQL Archive Log Partition Reaching 95% – Need Automated Cleanup