Re: Emitting JSON to file using COPY TO

From: Joe Conway <mail(at)joeconway(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>, jian he <jian(dot)universality(at)gmail(dot)com>, Junwang Zhao <zhjwpku(at)gmail(dot)com>
Cc: Florents Tselai <florents(dot)tselai(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, Davin Shearer <davin(at)apache(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Emitting JSON to file using COPY TO
Date: 2026-03-05 18:06:37
Message-ID: 6cbbf7f3-da4a-4201-98a6-4347a3c35db0@joeconway.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On 3/4/26 10:51, Andrew Dunstan wrote:
>
> On 2026-02-08 Su 10:48 PM, jian he wrote:
>> On Sat, Feb 7, 2026 at 9:27 PM Junwang Zhao<zhjwpku(at)gmail(dot)com> wrote:
>>> Here are some comments on v23:
>>>
>>> 0001: The refactor looks straightforward to me. Introducing a format
>>> field should make future extensions easier. One suggestion is that we
>>> could add some helper macros around format, for example:
>>>
>>> #define IS_FORMAT_CSV(format) (format == COPY_FORMAT_CSV)
>>> #define IS_FORMAT_TEXT_LIKE(format) \
>>> (format == COPY_FORMAT_TEXT || format == COPY_FORMAT_CSV)
>>>
>>> I think this would improve readability.
>> Personally, I don't like marcos....
>>
>>> 0002: Since you have moved the `CopyFormat enum` into 0001, the
>>> following commit msg should be rephrased.
>>>
>>> The CopyFormat enum was originally contributed by Joel Jacobson
>>> joel(at)compiler(dot)org, later refactored by Jian He to address various issues, and
>>> further adapted by Junwang Zhao to support the newly introduced CopyToRoutine
>>> struct (commit 2e4127b6d2).
>>>
>>> - if (opts_out->format == COPY_FORMAT_BINARY && opts_out->delim)
>>> - ereport(ERROR,
>>> - (errcode(ERRCODE_SYNTAX_ERROR),
>>> - /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
>>> - errmsg("cannot specify %s in BINARY mode", "DELIMITER")));
>>> + if (opts_out->delim)
>>> + {
>>> + if (opts_out->format == COPY_FORMAT_BINARY)
>>> + ereport(ERROR,
>>> + errcode(ERRCODE_SYNTAX_ERROR),
>>> + /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
>>> + errmsg("cannot specify %s in BINARY mode", "DELIMITER"));
>>> + else if (opts_out->format == COPY_FORMAT_JSON)
>>> + ereport(ERROR,
>>> + errcode(ERRCODE_SYNTAX_ERROR),
>>> + errmsg("cannot specify %s in JSON mode", "DELIMITER"));
>>> + }
>>>
>>> Can we add a function that converts CopyFormat to a string? Treating
>>> CopyFormat as %s in error messages would make the code shorter.
>>> However, I'm not sure whether this aligns with translation
>>> conventions, correct me if I'm wrong.
>>>
>> I don’t think this is worth the added complexity.
>> That said, I tried to simplify the code and changed it to:
>>
>> if (opts_out->delim &&
>> (opts_out->format == COPY_FORMAT_BINARY ||
>> opts_out->format == COPY_FORMAT_JSON))
>> ereport(ERROR,
>> errcode(ERRCODE_SYNTAX_ERROR),
>> opts_out->format == COPY_FORMAT_BINARY
>> ? errmsg("cannot specify %s in BINARY mode", "DELIMITER")
>> : errmsg("cannot specify %s in JSON mode", "DELIMITER"));
>>
>>> - * CSV and text formats share the same TextLike routines except for the
>>> + * CSV and text, json formats share the same TextLike routines except for the
>>>
>>> I'd suggest rewording to `CSV, text and json ...`. The same applied to
>>> other parts in this patch.
>>>
>> sure.
>>
>>> 0003: The commit message includes some changes(adapt the newly
>>> introduced CopyToRoutine) that actually belong to 0002; it would be
>>> better to remove them from this commit.
>>>
>> 0002 commit message:
>> """
>> This introduces the JSON format option for the COPY TO command, allowing users
>> to export query results or table data directly as a single JSON object or a
>> stream of JSON objects.
>>
>> The JSON format is currently supported only for COPY TO operations; it
>> is not available for COPY FROM.
>>
>> JSON format is incompatible with some standard text/CSV parsing or
>> formatting options,
>> including:
>> - HEADER
>> - DEFAULT
>> - NULL
>> - DELIMITER
>> - FORCE QUOTE / FORCE NOT NULL
>>
>> Regression tests covering valid JSON exports and error handling for
>> incompatible options have been added to src/test/regress/sql/copy.sql.
>> """
>>
>> 0003 commit message:
>> """
>> Add option force_array for COPY JSON FORMAT
>> This adds the force_array option, which is available exclusively
>> when using COPY TO with the JSON format.
>>
>> When enabled, this option wraps the output in a top-level JSON array
>> (enclosed in square brackets with comma-separated elements), making the
>> entire result a valid single JSON value. Without this option, the default
>> behavior is to output a stream of independent JSON objects.
>>
>> Attempting to use this option with COPY FROM or with formats other than
>> JSON will raise an error.
>> """
>>
>>> + if (cstate->json_row_delim_needed && cstate->opts.force_array)
>>> + CopySendChar(cstate, ',');
>>> + else if (cstate->opts.force_array)
>>> + {
>>> + /* first row needs no delimiter */
>>> + CopySendChar(cstate, ' ');
>>> + cstate->json_row_delim_needed = true;
>>> + }
>>>
>>> can we do this:
>>>
>>> if (cstate->opts.force_array)
>>> {
>>> if (cstate->json_row_delim_needed)
>>> CopySendChar(cstate, ',');
>>> else
>>> {
>>> /* first row needs no delimiter */
>>> CopySendChar(cstate, ' ');
>>> cstate->json_row_delim_needed = true;
>>> }
>>> }
>>>
>> good suggestion.
>>
>> If more people think WRAP_ARRAY is better than FORCE_ARRAY, we can
>> switch to it accordingly.
>> The change itself is quite straightforward.
>
>
> I have reworked these. First I cleaned up a number of things in patches
> 2 and 3 (Thanks, Claude for the summary):
>
> Patch 2: json format for COPY TO
>
> copy.c:
>   - "json" → "JSON" in the COPY FROM rejection error message.
>
>   copyto.c:
>   1. TupleDesc setup runs once, not every row — Added
> json_tupledesc_ready flag; the memcpy/populate_compact_attribute/
> BlessTupleDesc block is now guarded by if (!cstate->json_tupledesc_ready).
>   2. Comment rewritten — Old: "the slot's TupleDesc may change during
> query execution". New: explains BlessTupleDesc registers the RECORDOID
> descriptor so lookup_rowtype_tupdesc inside composite_to_json can
>   find it.
>   3. Eliminated per-row makeStringInfo() — Added StringInfoData
> json_buf to the struct, initialized once in CopyToTextLikeStart in
> copycontext. Each row does resetStringInfo instead of allocating a new
>   StringInfo.
>   4. Column list rejection added — Error: "column selection is not
> supported in JSON mode" when attnamelist != NIL.
>   5. Improved SendCopyBegin comment — Old: "JSON format is always one
> non-binary column". New: explains each CopyData message contains one
> complete JSON object.
>
>   Tests:
>   6. Added copy copytest (style) to stdout (format json) to the error-
> case block.
>   7. Added copyjsontype table test with json, jsonb columns — verifies
> values are embedded directly, not double-encoded. Covers json objects,
> scalars, arrays, nested objects, and nulls.
>
>   Patch 3: Add option force_array
>
>   copy.c:
>   1. "json" → "JSON" in COPY FROM error (carried from patch 2).
>   2. "can only used" → "can only be used" — grammar fix in FORCE_ARRAY
> error.
>
>   copyto.c:
>   3. Struct fields reorganized — JSON fields grouped under /* JSON
> format state */ comment with inline descriptions, instead of a
> standalone json_row_delim_needed with a vague comment.
>   4. Block comment updated — Was "CSV, text and json formats share the
> same TextLike routines except for the one-row callback". Now correctly
> notes JSON has its own one-row and end callbacks.
>   5. CopyToTextLikeEnd comment fixed — Was "text, CSV, and json", now
> "text and CSV" (JSON uses CopyToJsonEnd).
>   6. Cleaner start callback — FORCE_ARRAY bracket emission moved inside
> the if (format == JSON) block after json_buf init, instead of a separate
> top-level conditional.
>   7. Inherits all patch 2 fixes — TupleDesc guard, reusable json_buf,
> improved comments, column list rejection.
>
>   Tests:
>   8. --Error → -- should fail: force_array requires json format; --ok →
> -- force_array variants.
>   9. copyjsontype test carried through from patch 2.
>
> Then I reworked the way this works. In order to support column lists
> with JSON output, we need to deal with individual columns instead of
> whole records. This involved quite a number of changes, as can be seen
> in patch 4. This involved exporting a new small function from json.c.
>
> The result is a lot cleaner, I believe, and in my benchmarking is faster
> by a factor of almost 2.

Andrew,

I don't see the actual patches. Did I miss it somewhere?

--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrew Dunstan 2026-03-05 20:49:55 Re: Emitting JSON to file using COPY TO
Previous Message Adrian Klaver 2026-03-05 16:04:36 Re: Index (primary key) corrupt?

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2026-03-05 18:17:07 Re: Consider low startup cost in add_partial_path
Previous Message Andrey Borodin 2026-03-05 18:02:45 Re: [PATCH] Simplify SortSupport implementation for macaddr