Re: Proposal to improve the content in subsection 8.16.6. "Composite Type Input and Output Syntax"

From: Bryn Llewellyn <bryn(at)yugabyte(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-docs(at)lists(dot)postgresql(dot)org
Subject: Re: Proposal to improve the content in subsection 8.16.6. "Composite Type Input and Output Syntax"
Date: 2020-04-06 07:46:21
Message-ID: 07F46765-C942-4926-94D1-D1FBE156C877@yugabyte.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> ...

It's already an example, so I don't see this as an improvement.

> ...

The proposed additional text is flat-out wrong.

What actually happens here is that text between quotes is considered
quoted (so that, for example, commas within it are not field separators),
but that does not exclude there being other unquoted text within the
same field value.

regards, tom lane

Thanks, Tom. Your reply was very helpful. It prompted me to read, re-read, and then re-re-read 8.16.6 several times. I did notice that neither of “varchar” nor “char” are found in the whole of section 8.16. But I suppose that it’s obvious to the reader that “text” is to be taken as a term of art, denoting “character data types” and not as the name of one of these data types.

I ran some fairly exhaustive empirical tests. Not to be mysterious, I created two “.sql” files. The zip is attached.

record_literal.sql
------------------
This uses "type rt as (v text)” and "table t(k serial primary key, literal text, r rt)”.
It inserts a series of text values into column “literal” as is and into column “r” typecast to “rt”.
I aimed to end up with specified target values shown by “select '>'||(r).v||’<‘ from t”.

Here are two example findings,

First: for this specified target:
>hello<

I found that each of these five literals produced the same result:
(hello)
("hello")
(h\ello)
("h\ello")
(""hello"")

The result was canonically rendered in psql (by its implicit typecast to “text:) thus:
(hello)

Second: for this specified target:
> hello "you" <

I found that each of these six literals produced the same result:
( hello \"you\” )
("" hello \"you\" "")
(" hello \"you\" ")
(" hello ""you"" ")
( "hello \"you\"" )
( "hello ""you""" )

The result was canonically rendered in psql thus:
(" hello ""you"" ")

This observation tells me that this:
""
serves as the syntax to escape a single double quote within a surrounding double quote pair, as an alternative to this:
\"

I searched in the whole of section 8.16 for this:
""
but I found its meaning defined only as the empty string when it occurs outside of a surrounding double quote pair. I assume that it’s obvious to most readers that it has a second meaning in the other context.

I found that, with is extra one rule, I could explain all my observations.

I observed, over all my tests, that there always exists a canonical form like this:
("...")
where the ellipsis indicates a well-formed sequence of characters that produces my specified target.

I’m going to elevate this to a hypothesis.

And unless/until it’s disproved, I’m going to adopt a rule of practice (for myself) always to use this form.

literal-for-array-of-records.sql
--------------------------------
This was designed to help me meet my ultimate goal: I need to write a client program (say, in python) that will process data from a file whose semantic content is sufficient to let me populate the “rt[]” field in one row of a table, given by “type rt as (n numeric, s text, t timestamp, b booleaan)”.

I’m afraid I couldn’t find the rules in the Version 11 PG doc. This probably reflects my poor searching skills. However, I did find a clue here:

stackoverflow.com/questions/7363516/how-to-formulate-an-array-literal-of-a-composite-type-containing-arrays

I used the approach that it showed me to construct an example value for the “rt[]” field bottom-up using a “DO” block. And then I printed the canonical form that represets the value. Here’s an example:

For an array with these two values (as shown in psql):

n | s | t | b
-------+-----+---------------------+---
42.17 | dog | 2020-02-17 20:55:00 | t

n | s | t | b
-------+---------------------+---------------------+---
19.13 | "Hello", she\he\r +| 2019-07-21 16:47:00 | f
| (said | is saying). | |

The “\r”, along with the “+” that I’m used to, seem to be artifacts of how psql displays what I created with chr(10). But only in this special context. (This "select 'a'||chr(10)||’b'” produces only the “+” but not the “\r”. Strange.)

I discovered that the output is produced by this:

{

"(42.17,dog,\"2020-02-17 20:55:00\",t)",
"(19.13,\"\"\"Hello\"\", she\\\\he
(said | is saying).\",\"2019-07-21 16:47:00\",f)"

}

I am able to formulate the rules to produce this programmatically (in Python) starting with my target values expressed as Python variables. Here’s how I did it. Notice, that for simplicity here, “rt” has just a single text field. The proper code has to put the surrounding double quootes and parentheses around the comma-separated list of escaped field representations.

def fix_1(s):
# Do the escaping needed for a stringy value.
s = s.replace(backslash, two_backslashes)
s = s.replace(quote, two_quotes)
return s

def fix_2(s):
# Do the escaping to fix the bare record representation for use as a array element.
s = s.replace(backslash, two_backslashes)
s = s.replace(quote, backslash_quote)
return s

def format(s):
s = '("' + fix_1(s) + '")'
s = fix_2(s)
return s

I found out, along the way, that while it isn’t necessary to surround, say, a numeric value with double quotes when it takes its place in the literal for a “rt” value, neither is it harmful to do this. In the same way, the code that escapes special characters will have no effect, but will be harmless. I therefore decided not to implement extra logic to prepare such items differently from text items. I appreciate that my program is therefore slower than it might be.

Long story short, my goal is met. Thanks again.

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message David G. Johnston 2020-04-06 15:21:01 Re: Proposal to improve the content in subsection 8.16.6. "Composite Type Input and Output Syntax"
Previous Message Alvaro Herrera 2020-04-05 22:07:15 Re: Add A Glossary