Re: [PATCH] Initial progress reporting for COPY command

From: Josef Šimánek <josef(dot)simanek(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: [PATCH] Initial progress reporting for COPY command
Date: 2020-06-22 10:57:53
Message-ID: CAFp7Qwq7OUkEUr=187fJSvKbkwigq-_AJXuB6ch7r__0Quy11g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

po 22. 6. 2020 v 9:15 odesílatel vignesh C <vignesh21(at)gmail(dot)com> napsal:

> On Sun, Jun 21, 2020 at 5:11 PM Josef Šimánek <josef(dot)simanek(at)gmail(dot)com>
> wrote:
> >
> > Thanks for all comments. I have updated code to support more options
> (including STDIN/STDOUT) and added some documentation.
> >
> > Patch is attached and can be found also at
> https://github.com/simi/postgres/pull/5.
> >
> > Diff version: https://github.com/simi/postgres/pull/5.diff
> > Patch version: https://github.com/simi/postgres/pull/5.patch
> >
> > I'm also attaching screenshot of HTML documentation and html
> documentation file.
> >
> > I'll do my best to get this to commitfest now.
> >
> > ne 14. 6. 2020 v 14:32 odesílatel Josef Šimánek <josef(dot)simanek(at)gmail(dot)com>
> napsal:
> >>
> >> Hello, as proposed by Pavel Stěhule and discussed on local czech
> PostgreSQL maillist (
> https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
> I have prepared an initial patch for COPY command progress reporting.
> >>
> >> Few examples first:
> >>
> >> "COPY (SELECT * FROM test) TO '/tmp/ids';"
> >>
> >> yr=# SELECT * from pg_stat_progress_copy;
> >> pid | datid | datname | relid | direction | file | program |
> lines_processed | file_bytes_processed
> >>
> ---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
> >> 3347126 | 16384 | yr | 0 | TO | t | f |
> 3529943 | 24906226
> >> (1 row)
> >>
> >> "COPY test FROM '/tmp/ids';
> >>
> >> yr=# SELECT * from pg_stat_progress_copy;
> >> pid | datid | datname | relid | direction | file | program |
> lines_processed | file_bytes_processed
> >>
> ---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
> >> 3347126 | 16384 | yr | 16385 | FROM | t | f |
> 121591999 | 957218816
> >> (1 row)
> >>
> >> Columns are inspired by CREATE INDEX progress report system view.
> >>
> >> pid - integer - PID of backend
> >> datid - oid - OID of related database
> >> datname - name - name of related database (this seems redundant, since
> oid should be enough, but it is the same in CREATE INDEX)
> >> relid - oid - oid of table related to COPY command, when not known (for
> example when copying to file, it is 0)
> >> direction - text - one of "FROM" or "TO" depends on command used
> >> file - bool - is file is used?
> >> program - bool - is program used?
> >> lines_processed - bigint - amount of processed lines, works for both
> directions (FROM/TO)
> >> file_bytes_processed - amount of bytes processed when file is used
> (otherwise 0), works for both direction (
> >> FROM/TO) when file is used (file = t)
> >>
> >> Patch is attached and can be found also at
> https://github.com/simi/postgres/pull/5.
> >>
>
> Few comments:
> @@ -713,6 +714,8 @@ CopyGetData(CopyState cstate, void *databuf, int
> minread, int maxread)
> break;
> }
>
> + CopyUpdateBytesProgress(cstate, bytesread);
> +
> return bytesread;
> }
>
> This is actually the read data, actual processing will happen later
> like in CopyReadLineText, it would be better if
> CopyUpdateBytesProgress is done later, if not it will give the same
> value even though it does multiple inserts on the table.
> lines_processed will keep getting updated but file_bytes_processed
> will not be updated.
>

First I would like to explain what's reported (or at least I'm trying to
get reported) at bytes_processed column.

When exporting to file it should start at 0 and end up at the actual final
file size.
When importing from file, it should do the same. You can check file size
before you start COPY FROM and get actual progress looking at
bytes_processed.

This column is just a counter of bytes read from input on COPY FROM or
amount of bytes going through COPY TO.

Thanks for the hint regarding "CopyReadLineText". I'll take a look.

For now I have tested those cases:

CREATE TABLE test(id int);
INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
COPY (SELECT * FROM test) TO '/tmp/ids';
COPY test FROM '/tmp/ids';

psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000)) TO
STDOUT;' > /tmp/ryba.txt
echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'

It is easy to check lines count and bytes count are in sync (since 1 line
is 2 bytes here - "1" and newline character).
I'll try to check more complex COPY commands to ensure everything is in
sync.

If you have any ideas for testing queries, feel free to suggest.

+pg_stat_progress_copy| SELECT s.pid,
> + s.datid,
> + d.datname,
> + s.relid,
> + CASE s.param1
> + WHEN 0 THEN 'TO'::text
> + WHEN 1 THEN 'FROM'::text
> + ELSE NULL::text
> + END AS direction,
> + ((s.param2)::integer)::boolean AS file,
> + ((s.param3)::integer)::boolean AS program,
> + s.param4 AS lines_processed,
> + s.param5 AS file_bytes_processed
>
> You could include pg_size_pretty for s.param5 like
> pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
> users to understand bytes_processed when the data size increases.

I was looking at the rest of reporting views and for me those seem to be
just basic ones providing just raw data to be used later in custom nice
friendly human-readable views built on the client side.
For example "pg_stat_progress_basebackup" also reports "backup_streamed" in
raw form.

Anyway if you would like to make this view more user-friendly, I can add
that. Just ping me.

>
>
Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-06-22 11:00:41 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Previous Message Amit Kapila 2020-06-22 10:56:44 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions