Re: Make COPY format extendable: Extract COPY TO format implementations

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Sutou Kouhei <kou(at)clear-code(dot)com>
Cc: zhjwpku(at)gmail(dot)com, andrew(at)dunslane(dot)net, michael(at)paquier(dot)xyz, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Date: 2024-01-30 02:11:59
Message-ID: CAD21AoBmNiWwrspuedgAPgbAqsn7e7NoZYF6gNnYBf+gXEk9Mg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 29, 2024 at 6:45 PM Sutou Kouhei <kou(at)clear-code(dot)com> wrote:
>
> Hi,
>
> In <CAEG8a3Jnmbjw82OiSvRK3v9XN2zSshsB8ew1mZCQDAkKq6r9YQ(at)mail(dot)gmail(dot)com>
> "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 29 Jan 2024 11:37:07 +0800,
> Junwang Zhao <zhjwpku(at)gmail(dot)com> wrote:
>
> >> > > Does it make sense to pass only non-builtin options to the custom
> >> > > format callback after parsing and evaluating the builtin options? That
> >> > > is, we parse and evaluate only the builtin options and populate
> >> > > opts_out first, then pass each rest option to the custom format
> >> > > handler callback. The callback can refer to the builtin option values.
> >>
> >> What I imagined is that while parsing the all specified options, we
> >> evaluate builtin options and we add non-builtin options to another
> >> list. Then when parsing a non-builtin option, we check if this option
> >> already exists in the list. If there is, we raise the "option %s not
> >> recognized" error.". Once we complete checking all options, we pass
> >> each option in the list to the callback.
>
> I implemented this idea and the following ideas:
>
> 1. Add init callback for initialization
> 2. Change GetFormat() to FillCopyXXXResponse()
> because JSON format always use 1 column
> 3. FROM only: Eliminate more cstate->opts.csv_mode branches
> (This is for performance.)
>
> See the attached v9 patch set for details. Changes since v7:
>
> 0001:
>
> * Move CopyToProcessOption() calls to the end of
> ProcessCopyOptions() for easy to option validation
> * Add CopyToState::CopyToInit() and call it in
> ProcessCopyOptionFormatTo()
> * Change CopyToState::CopyToGetFormat() to
> CopyToState::CopyToFillCopyOutResponse() and use it in
> SendCopyBegin()

Thank you for updating the patch! Here are comments on 0001 patch:

---
+ if (!format_specified)
+ /* Set the default format. */
+ ProcessCopyOptionFormatTo(pstate, opts_out, cstate, NULL);
+

I think we can pass "text" in this case instead of NULL. That way,
ProcessCopyOptionFormatTo doesn't need to handle NULL case.

We need curly brackets for this "if branch" as follows:

if (!format_specifed)
{
/* Set the default format. */
ProcessCopyOptionFormatTo(pstate, opts_out, cstate, NULL);
}

---
+ /* Process not built-in options. */
+ foreach(option, unknown_options)
+ {
+ DefElem *defel = lfirst_node(DefElem, option);
+ bool processed = false;
+
+ if (!is_from)
+ processed =
opts_out->to_routine->CopyToProcessOption(cstate, defel);
+ if (!processed)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate,
defel->location)));
+ }
+ list_free(unknown_options);

I think we can check the duplicated options in the core as we discussed.

---
+static void
+CopyToTextBasedInit(CopyToState cstate)
+{
+}

and

+static void
+CopyToBinaryInit(CopyToState cstate)
+{
+}

Do we really need separate callbacks for the same behavior? I think we
can have a common init function say CopyToBuitinInit() that does
nothing. Or we can make the init callback optional.

The same is true for process-option callback.

---
List *convert_select; /* list of column names (can be NIL) */
+ const CopyToRoutine *to_routine; /* callback
routines for COPY TO */
} CopyFormatOptions;

I think CopyToStateData is a better place to have CopyToRoutine.
copy_data_dest_cb is also there.

---
- if (strcmp(fmt, "text") == 0)
- /* default format */ ;
- else if (strcmp(fmt, "csv") == 0)
- opts_out->csv_mode = true;
- else if (strcmp(fmt, "binary") == 0)
- opts_out->binary = true;
+
+ if (is_from)
+ {
+ char *fmt = defGetString(defel);
+
+ if (strcmp(fmt, "text") == 0)
+ /* default format */ ;
+ else if (strcmp(fmt, "csv") == 0)
+ {
+ opts_out->csv_mode = true;
+ }
+ else if (strcmp(fmt, "binary") == 0)
+ {
+ opts_out->binary = true;
+ }
else
- ereport(ERROR,
-
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("COPY format
\"%s\" not recognized", fmt\),
-
parser_errposition(pstate, defel->location)));
+ ProcessCopyOptionFormatTo(pstate,
opts_out, cstate, defel);

The 0002 patch replaces the options checks with
ProcessCopyOptionFormatFrom(). However, both
ProcessCopyOptionFormatTo() and ProcessCOpyOptionFormatFrom() would
set format-related options such as opts_out->csv_mode etc, which seems
not elegant. IIUC the reason why we process only the "format" option
first is to set the callback functions and call the init callback. So
I think we don't necessarily need to do both setting callbacks and
setting format-related options together. Probably we can do only the
callback stuff first and then set format-related options in the
original place we used to do?

---
+static void
+CopyToTextBasedFillCopyOutResponse(CopyToState cstate, StringInfoData *buf)
+{
+ int16 format = 0;
+ int natts = list_length(cstate->attnumlist);
+ int i;
+
+ pq_sendbyte(buf, format); /* overall format */
+ pq_sendint16(buf, natts);
+ for (i = 0; i < natts; i++)
+ pq_sendint16(buf, format); /* per-column formats */
+}

This function and CopyToBinaryFillCopyOutResponse() fill three things:
overall format, the number of columns, and per-column formats. While
this approach is flexible, extensions will have to understand the
format of CopyOutResponse message. An alternative is to have one or
more callbacks that return these three things.

---
+ /* Get info about the columns we need to process. */
+ cstate->out_functions = (FmgrInfo *) palloc(num_phys_attrs *
sizeof(Fmgr\Info));
+ foreach(cur, cstate->attnumlist)
+ {
+ int attnum = lfirst_int(cur);
+ Oid out_func_oid;
+ bool isvarlena;
+ Form_pg_attribute attr = TupleDescAttr(tupDesc, attnum - 1);
+
+ getTypeOutputInfo(attr->atttypid, &out_func_oid, &isvarlena);
+ fmgr_info(out_func_oid, &cstate->out_functions[attnum - 1]);
+ }

Is preparing the out functions an extension's responsibility? I
thought the core could prepare them based on the overall format
specified by extensions, as long as the overall format matches the
actual data format to send. What do you think?

---
+ /*
+ * Called when COPY TO via the PostgreSQL protocol is
started. This must
+ * fill buf as a valid CopyOutResponse message:
+ *
+ */
+ /*--
+ * +--------+--------+--------+--------+--------+ +--------+--------+
+ * | Format | N attributes | Attr1's format |...| AttrN's format |
+ * +--------+--------+--------+--------+--------+ +--------+--------+
+ * 0: text 0: text 0: text
+ * 1: binary 1: binary 1: binary
+ */

I think this kind of diagram could be missed from being updated when
we update the CopyOutResponse format. It's better to refer to the
documentation instead.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2024-01-30 02:33:19 Re: Support run-time partition pruning for hash join
Previous Message Euler Taveira 2024-01-30 02:01:38 Re: speed up a logical replica setup