Re: Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Caleb Welton <cwelton(at)pivotal(dot)io>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)
Date: 2015-12-13 11:39:40
Message-ID: CAMsr+YEPN0R91U=bBeWHKT2Xi2aji6PXwFkVgvM88oLi5bOmAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13 December 2015 at 06:31, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I'm not particularly wedded to this rule. In principle we could go so
> far as to import psql's code that parses commands and figures out which
> semicolons are command terminators --- but that is a pretty large chunk
> of code, and I think it'd really be overkill considering that initdb
> deals only with fixed input scripts.

Shouldn't that be a bison/flex job anyway, rather than hand-coded? Or a
simple(ish) state machine?

Dollar-quoted strings are probably the only quite ugly bit due to their
arbitrary delimiters. So I thought I'd sketch out how it'd look as a state
machine. At which point I remembered that we allow $ in identifiers too. So
the state machine would have to bother with unquoted identifiers. Of course
$ marks parameters, so it has to keep track of if it's reading a parameter.
At which point you have half an SQL parser.

This strikes me as a really good reason for making it re-usable, because
it's horrid to write code that handles statement splitting in the
PostgreSQL dialect.

Optional handling of psql \commands would be required, but that'd make it
easier for PgAdmin to support psql backslash commands, so there'd be a win
there too.

I figured I'd sketch it out for kicks. Comment: yuck.

States would be at least:

SQL_TEXT
SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD
NUMBER
QUOTED_IDENTIFIER
QUOTED_IDENTIFIER_QUOTE
SQL_TEXT_DOLLAR
DOLLAR_STRING_START_DELIM
DOLLAR_STRING
DOLLAR_STRING_DOLLAR
DOLLAR_STRING_END_DELIM
STANDARD_STRING
STANDARD_STRING_QUOTE
SQL_TEXT_E
ESCAPE_STRING
ESCAPE_STRING_ESCAPE

Transitions

SQL_TEXT => { SQL_TEXT, SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD, NUMBER,
QUOTED_IDENTIFIER, SQL_TEXT_DOLLAR, STANDARD_STRING, SQL_TEXT_E,
ESCAPE_STRING }

SQL_TEXT_E => { SQL_TEXT, ESCAPE_STRING, SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD
}

SQL_TEXT_DOLLAR => { SQL_TEXT, NUMBER, DOLLAR_STRING_START_DELIM }

QUOTED_IDENTIFIER => { QUOTED_IDENTIFIER, QUOTED_IDENTIFIER_QUOTE }

QUOTED_IDENTIFIER_QUOTE => { SQL_TEXT, QUOTED_IDENTIFIER }

DOLLAR_STRING_START_DELIM => { DOLLAR_STRING_START_DELIM, DOLLAR_STRING }

DOLLAR_STRING => { DOLLAR_STRING, DOLLAR_STRING_DOLLAR }

DOLLAR_STRING_END_DELIM => { DOLLAR_STRING_END_DELIM, SQL_TEXT,
DOLLAR_STRING }

STANDARD_STRING => { STANDARD_STRING, STANDARD_STRING_QUOTE }

STANDARD_STRING_QUOTE => { SQL_TEXT, STANDARD_STRING }

ESCAPE_STRING => { ESCAPE_STRING, ESCAPE_STRING_ESCAPE }

ESCAPE_STRING_ESCAPE => { SQL_TEXT, ESCAPE_STRING }

NUMBER consumes sequential digits and period chars and returns to SQL_TEXT
at any non-digit. (That way it can handle Pg's lazy parser quirks like
SELECT 123"f" being legal, and equivalent to SELECT 123 AS "f").

SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD is needed because a $ within an
identifier is part of the identifier so it can't just be consumed as
SQL_TEXT .

For dollar strings, when a $ is found when reading SQL text (not an
identifier/keyword), enter SQL_TEXT_DOLLAR. What comes next must be a
parameter or the start of a dollar string. If the next char is a digit then
it's a parameter so switch to NUMBER, since dollar-quoted string delims
follow identifier rules and unquoted identifiers can't start with a number.
Otherwise switch to DOLLAR_STRING_START_DELIM and consume until a $ is
found or something illegal in an identifier is found. Or of course it could
be lone $ which is bogus syntax but as far as we're concerned still just
SQL_TEXT. Really, this is just looking for a dollar-quote start and doesn't
care what it finds if it isn't a valid dollar-quote start.

If a valid dollar-quote delim is found switch to DOLLAR_STRING and read
until we find the matching delim using a similar process, entering
DOLLAR_STRING_DOLLAR, looking for param vs end delim, etc. When a full
delimiter is read compare to the start delimiter and switch back to
SQL_TEXT mode if it matches, otherwise remain in DOLLAR_STRING.

If an invalid dollar string delim was found switch back to SQL_TEXT (since
it wasn't a valid beginning of a dollar string) and continue.

For QUOTED_IDENTIFIER_QUOTE and STANDARD_STRING_QUOTE, it found a " or '
while reading a quoted identifier or standard string and transitioned into
that state. If the next char doubles the quote it'll return to reading the
string; otherwise it'll return to the SQL_TEXT state since the identifier
or literal has ended.

Similarly with ESCAPE_STRING_ESCAPE. Having found an escape, consume the
next char even if it's a quote and return to the string parsing.

All this ignores psql backslash commands.

Have I missed anything really obvious? Does it seem useful to have more
re-usable statement splitting code? Is there any sane justification for
doing anything but extracting what psql does verbatim while carefully
changing nothing?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-12-13 11:53:30 Re: Logical replication and multimaster
Previous Message Simon Riggs 2015-12-13 11:39:32 Re: Logical replication and multimaster