Re: Proposal to improve the content in subsection 8.16.6. "Composite Type Input and Output Syntax"

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Bryn Llewellyn <bryn(at)yugabyte(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-docs(at)lists(dot)postgresql(dot)org
Subject: Re: Proposal to improve the content in subsection 8.16.6. "Composite Type Input and Output Syntax"
Date: 2020-04-06 15:21:01
Message-ID: CAKFQuwZ9BZNUp4YjpszEZRJaVjDXZcHnRe+xPHLf=zSVMt__+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs

On Mon, Apr 6, 2020 at 12:46 AM Bryn Llewellyn <bryn(at)yugabyte(dot)com> wrote:

> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> ...
>
>
> It's already an example, so I don't see this as an improvement.
>
> ...
>
>
> The proposed additional text is flat-out wrong.
>
> What actually happens here is that text between quotes is considered
> quoted (so that, for example, commas within it are not field separators),
> but that does not exclude there being other unquoted text within the
> same field value.
>
> regards, tom lane
>
> Thanks, Tom. Your reply was very helpful. It prompted me to read, re-read,
> and then re-re-read 8.16.6 several times. I did notice that neither of
> “varchar” nor “char” are found in the whole of section 8.16
>

> But I suppose that it’s obvious to the reader that “text” is to be taken
> as a term of art, denoting “character data types” and not as the name of
> one of these data types.
>

The word "text" in Tom's paragraph isn't talking about a data type at all.
Its talking about symbols.

>
> Second: for this specified target:
> > hello "you" <
>
> I found that each of these six literals produced the same result:
> ( hello \"you\” )
> ("" hello \"you\" "")
> (" hello \"you\" ")
> (" hello ""you"" ")
> ( "hello \"you\"" )
> ( "hello ""you""" )
>
> The result was canonically rendered in psql thus:
> (" hello ""you"" ")
>
> This observation tells me that this:
> ""
> serves as the syntax to escape a single double quote within a surrounding
> double quote pair, as an alternative to this:
> \"
>
> I searched in the whole of section 8.16 for this:
> ""
> but I found its meaning defined only as the empty string when it occurs
> outside of a surrounding double quote pair. I assume that it’s obvious to
> most readers that it has a second meaning in the other context.
>

You need to expand your search to more than just the literal symbol written
"".....

Quoting from the 8.16.6 docs.....

"To put a double quote or backslash in a quoted composite field value,
precede it with a backslash. (Also, a [emp]pair of double quotes within a
double-quoted field value is taken to represent a double quote
character[emp], analogously to the rules for single quotes in SQL literal
strings.)"

"Double quotes and backslashes embedded in field values will be doubled."

I found that, with is extra one rule, I could explain all my observations.
>
> I observed, over all my tests, that there always exists a canonical form
> like this:
> ("...")
> where the ellipsis indicates a well-formed sequence of characters that
> produces my specified target.
>
> I’m going to elevate this to a hypothesis.
>
> And unless/until it’s disproved, I’m going to adopt a rule of practice
> (for myself) always to use this form.
>

The " " just means first treat the underlying content as being textual (but
untyped). Since every type has an input function that accepts textual data
and converts it to the relevant type this works.

However, you may wish to explore other data types comprising your composite
and consider boundary cases (empty string or missing/null) variations
before solidifying your hypothesis.

For example at least consider why the following fails.

create type rtint as (a int);
select '("")'::rtint;

>
> *literal-for-array-of-records.sql*
> *--------------------------------*
> This was designed to help me meet my ultimate goal: I need to write a
> client program (say, in python) that will process data from a file whose
> semantic content is sufficient to let me populate the “rt[]” field in one
> row of a table, given by “type rt as (n numeric, s text, t timestamp, b
> booleaan)”.
>
> I’m afraid I couldn’t find the rules in the Version 11 PG doc. This
> probably reflects my poor searching skills.
>

The immediately preceding chapter in the documentation discusses arrays at
the same level of detail and definitional authority as the chapter on
composite types. Its left as an exercise to the reader to deal with the
nesting dynamic - in particular the escaping of special characters
especially those common to both syntaxes.

https://www.postgresql.org/docs/12/arrays.html

n | s | t | b
> -------+---------------------+---------------------+---
> 19.13 | "Hello", she\he\r +| 2019-07-21 16:47:00 | f
> | (said | is saying). | |
>
> The “\r”, along with the “+” that I’m used to, seem to be artifacts of how
> psql displays what I created with chr(10). But only in this special
> context. (This "select 'a'||chr(10)||’b'” produces only the “+” but not
> the “\r”. Strange.)
>

Welcome to the fun world of line break (newline) character sequences. Just
for fun:

On a typewriter if you want to begin a new line your machine has to
physically do two things:
1 - move the piece of paper up one row (line feed) - this command code was
assigned chr(10) and is abbreviated \n
2 - move the strike bar back to the starting position (carriage return) -
code chr(13) and is abbreviated \r

Windows faithfully reproduced the typewriter action and defines the two
character sequence chr(13) + chr(10) as the newline operation.

Linux (its predecessors) took a more practical approach and decided since a
computer isn't a typewriter it doesn't need both characters to represent a
new line operation, one is sufficient and "line feed" makes the most sense
so chr(10) is chosen.

Now onto psql - when it encounters a linefeed in the value data instead of
printing out the escape it actually breaks the content being printed into
lines. It also prints a plus sign (+) and the end of the line.

This seems to be under-documented. I'm also unsure whether the behavior
has been normalized for different platforms.

>
> I discovered that the output is produced by this:
>
> {
>
> "(42.17,dog,\"2020-02-17 20:55:00\",t)",
> "(19.13,\"\"\"Hello\"\", she\\\\he
> (said | is saying).\",\"2019-07-21 16:47:00\",f)"
>
> }
>
> I am able to formulate the rules to produce this programmatically (in
> Python) starting with my target values expressed as Python variables.
> Here’s how I did it. Notice, that for simplicity here, “rt” has just a
> single text field. The proper code has to put the surrounding double
> quootes and parentheses around the comma-separated list of escaped field
> representations.
>
>
If you accept that only PostgreSQL is going to read the resulting data you
might want to consider taking the recommendation at the end of the big note
in 8.16.6 and output dollar quotes instead of double quotes - at least in
the array construction case.

David J.

In response to

Browse pgsql-docs by date

  From Date Subject
Next Message Jeff Davis 2020-04-07 18:20:46 Default setting for enable_hashagg_disk
Previous Message Bryn Llewellyn 2020-04-06 07:46:21 Re: Proposal to improve the content in subsection 8.16.6. "Composite Type Input and Output Syntax"