Re: BUG #16300: Text line order corruption with COPY command

From: Francisco Olarte <folarte(at)peoplecall(dot)com>
To: Hans Buschmann <buschmann(at)nidsa(dot)net>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16300: Text line order corruption with COPY command
Date: 2020-03-13 09:26:16
Message-ID: CA+bJJbyVpnDmUovVo4+1cVx_Fg6v1giuxL+_nbRar=+JhdW0_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hans:

On Fri, Mar 13, 2020 at 10:05 AM Hans Buschmann <buschmann(at)nidsa(dot)net> wrote:
...
> When looking into the documentation I find under SQL COPY command:
> "If a column list is specified, COPY TO copies only the data in the specified columns to the file.
> For COPY FROM, each field in the file is inserted, in order, into the specified column.
> Table columns not specified in the COPY FROM column list will receive their default values. "
> So I expected the insertion in order as said above.

And the column order ( which is what you have quoted here ) is
preserved ( easily, since you only have one).

> In my opinion it is essential to preserve the order of textfile input in COPY FROM. This also holds true when copying from another source like a program, where often it is not practicable to add an orderable column on input.

You are talking about ROW order. This have been discussed countless
time for all sort of sql databases. SQL TABLES are SETS of TUPLES
(rows), without order.

> Not preserving (and rendering on a select) the order makes the COPY FROM mostly unusable for cases where the order must be preserved. (Think of XML, JSON, Source code, Log files, Disassembly etc.).

I think you are not using SQL correctly. When you put "files" into the
database ( XML, JSON, Source Code, Disassembly above), the correct
behaviour is normally to put each file in a single field in a single
row. For Log files the thing is different. I normally do not want to
preserve import order on these, I send an explicit order, typically by
the timestamp for similar field, when reading, so that even if I
import the weekend files in sunday-saturday order I get them right,
and I can mix logs from several sources and see them ordered ( that's
why I import them when I (rarely) do ).

If you want to model TEXT files as a table of lines you need to use a
line number of similar thing, and then you have to deal with several
problems ( in your XML example, I can delete some clines in the midle
of your xml, you cannot detect it. If I preserve insertion order and
want to INSERT a line in the middle of the document, how do O do it?
). Having a text file is generally bad idea.

> The size of such kind of files can it make also impractible to take the mentioned circumventing methods you provided.

Putting big text files in the database is not generally a good idea.
The only thing you can do with those is read the lines sequentally,
the plain file excels at this.

> I have not inspected the resulting tuple orders on the physical file.
> I will try to nail down the first occurence of the disordering during the import by comparing import and export.

You can do these, but it will be useless. Tuples can move, actually
they only do as a result of some operations, but nothing guarantees
they are not going to do it on their own for some reason in the
future. AFAIK, except for updates, they do not move, and I think
vacuum full preserves ordering, but nothing guarantees you postgres 42
is not going to gain a "repack" function which moves tuples to
increase the fill factor in the first pages of the files destroying
your order.

You do not have a command to ask an sql database for the tuples in
insertion order, orderless select just mean any order. Even if they
were ordered in the file you could have something like this:
- Connection 1 ask for all the lines of a fairly huge table ( much
bigger than caches), the engine is free to return them in any order so
it peeks the easier, file order, and starts returning them.
- When c1 is 20% into the job, c2 ask for the same rows. Engine
notices c1 is already doing the same and starts sending the rows from
20-100% of the file to both connections.
- c1 gets all of them, engine rereads rows in the 0-20% range and
sends them to c2. Huge win, only 120% of data read, instead of 200% (
remember the "much bigger than cache" note ) ( engine could have
stopped c1 and reread 0-20% to let them sync, but we will have a very
unhappy c1 them ).

I'm not saying postgres does this, but one of the reason sql language
does not impose any unrequested order is to let it do it.

> My concern also goes to pg_dump and friends, which could change such not natural orderable tables through pg_dump/pg_restore.

There is not such thing as a natural orderable table. Number them or
you are in for a lot of pain. Or use some kind of database built for
text line processing, not a relational one.

Francisco Olarte.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message David G. Johnston 2020-03-13 15:19:49 Re: BUG #16300: Text line order corruption with COPY command
Previous Message Hans Buschmann 2020-03-13 08:49:50 AW: BUG #16300: Text line order corruption with COPY command