Re: Improvements and additions to COPY progress reporting

From: Josef Šimánek <josef(dot)simanek(at)gmail(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Improvements and additions to COPY progress reporting
Date: 2021-02-09 07:02:55
Message-ID: CAFp7QwrB8qBdPsqD79mpiU=vGSvcJB2jW1Eg1WYEyjGEkH8-kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

po 8. 2. 2021 v 19:35 odesílatel Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> napsal:
>
> Hi,
>
> With [0] we got COPY progress reporting. Before the column names of
> this newly added view are effectively set in stone with the release of
> pg14, I propose the following set of relatively small patches. These
> are v2, because it is a patchset that is based on a set of patches
> that I previously posted in [0].

Hello. I had this in my backlog to revisit this feature as well before
the release. Thanks for picking this up.

> 0001 Adds a column to pg_stat_progress_copy which details the amount
> of tuples that were excluded from insertion by the WHERE clause of the
> COPY FROM command.
>
> 0002 alters pg_stat_progress_copy to use 'tuple'-terminology instead
> of 'line'-terminology. 'Line' doesn't make sense in the binary copy
> case, and only for the 'text' copy format there can be a guarantee
> that the source / output file actually contains the reported amount of
> lines, whereas the amount of data tuples (which is also what it's
> called internally) is guaranteed to equal for all data types.
>
> There was some discussion about this in [0] where the author thought
> 'line' is more consistent with the CSV documentation, and where I
> argued that 'tuple' is both more consistent with the rest of the
> progress reporting tables and more consistent with the actual counted
> items: these are the tuples serialized / inserted (as noted in the CSV
> docs; "Thus the files are not strictly one line per table row like
> text-format files.").

As an mentioned author I have no preference over line or tuple
terminology here. For some cases "line" terminology fits better, for
some "tuple" one. Docs can be improved later if needed to make it
clear at some cases (for example in most common case probably - CSV
import/export) tuple equals one line in CSV.

> Patch 0003 adds backlinks to the progress reporting docs from the docs
> of the commands that have progress reporting (re/index, cluster,
> vacuum, etc.) such that progress reporting is better discoverable from
> the relevant commands, and removes the datname column from the
> progress_copy view (that column was never committed). This too should
> be fairly trivial and uncontroversial.
>
> 0004 adds the 'command' column to the progress_copy view; which
> distinguishes between COPY FROM and COPY TO. The two commands are (in
> my opinion) significantly different enough to warrant this column;
> similar to the difference between CREATE INDEX/REINDEX [CONCURRENTLY]
> which also report that information. I believe that this change is
> appropriate; as the semantics of the columns change depending on the
> command being executed.

This was part of my initial patch as well, but I decided to strip it
out to make the final patch as small as possible to make it quickly
mergeable without need of further discussion. From my side this is
useful to have directly in the progress report as well.

> Lastly, 0005 adds 'io_target' to the reported information, that is,
> FILE, PROGRAM, STDIO or CALLBACK. Although this can relatively easily
> be determined based on the commands in pg_stat_activity, it is
> reasonably something that a user would want to query on, as the
> origin/target of COPY has security and performance implications,
> whereas other options (e.g. format) are less interesting for clients
> that are not executing that specific COPY command.

Similar (simplified, not supporting CALLBACK) info was also part of
the initial patch and stripped out later. I'm also +1 on this info
being useful to have directly in the progress report.

> Of special interest in 0005 is that it reports the io_target for the
> logical replications' initial tablesyncs' internal COPY. This would
> otherwise be measured, but no knowledge about the type of copy (or its
> origin) would be available on the worker's side. I'm not married to
> this patch 0005, but I believe it could be useful, and therefore
> included it in the patchset.

All patches seem good to me. I was able to apply them to current clean
master and "make check" has succeeded without problems.

>
> With regards,
>
> Matthias van de Meent.
>
>
> [0] https://www.postgresql.org/message-id/flat/CAFp7Qwr6_FmRM6pCO0x_a0mymOfX_Gg%2BFEKet4XaTGSW%3DLitKQ%40mail.gmail.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-02-09 07:12:32 Re: Improvements and additions to COPY progress reporting
Previous Message Bharath Rupireddy 2021-02-09 06:57:21 Re: Is Recovery actually paused?