Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Mark Dilger <hornschnorter(at)gmail(dot)com>, Caleb Welton <cwelton(at)pivotal(dot)io>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)
Date: 2015-12-12 22:31:49
Message-ID: 2098.1449959509@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> BTW, there's another thing I'd like to see improved in this area, which is
> a problem already but will get a lot worse if we push more work into the
> post-bootstrap phase of initdb. That is that the post-bootstrap phase is
> both inefficient and impossible to debug. If you've ever had a failure
> there, you'll have seen that the backend spits out an entire SQL script
> and says there's an error in it somewhere; that's because it gets the
> whole per-stage script as one submission. (Try introducing a syntax error
> somewhere in information_schema.sql, and you'll see what I mean.)
> Breaking the stage scripts down further would help, but that is
> unattractive because each one requires a fresh backend startup/shutdown,
> including a full checkpoint. I'd like to see things rejiggered so that
> there's only one post-bootstrap standalone backend session that performs
> all the steps, but initdb feeds it just one SQL command at a time so that
> errors are better localized. That should both speed up initdb noticeably
> and make debugging easier.

I thought this sounded like a nice lazy-Saturday project, so I started
poking at it, and attached is a WIP patch. The core issue that has to
be dealt with is that standalone-backend mode currently has just two
rules for deciding when to stop collecting input and execute the command
buffer, and they both suck:

1. By default, execute after every newline. (Actually, you can quote
a newline with a backslash, but we don't use that ability anywhere.)

2. With -j, slurp the entire input until EOF, and execute it as one
giant multicommand string.

We're doing #2 to handle information_schema.sql and the other large
SQL scripts that initdb runs, which is why the response to an error in
those scripts is so yucky.

After some experimentation, I came up with the idea of executing any
time that a semicolon followed by two newlines is seen. This nicely
breaks up input like information_schema.sql. There are probably some
residual places where more than one command is executed in a single
string, but we could fix that with some more newlines. Obviously,
this rule is capable of being fooled if you have a newline followed by
a blank line in a comment or quoted literal --- but it turns out that
no such case exists anywhere in initdb's data.

I'm not particularly wedded to this rule. In principle we could go so
far as to import psql's code that parses commands and figures out which
semicolons are command terminators --- but that is a pretty large chunk
of code, and I think it'd really be overkill considering that initdb
deals only with fixed input scripts. But if anyone has another simple
rule for breaking SQL into commands, we can certainly discuss
alternatives.

Anyway, the attached patch tweaks postgres.c to follow that rule instead
of slurp-to-EOF when -j is given. I doubt that being non-backwards-
compatible is a problem here; in fact, I'm tempted to rip out the -j
switch altogether and just have standalone mode always parse input the
same way. Does anyone know of people using standalone mode other than
for initdb?

The other part of the patch modifies initdb to do all its post-bootstrap
steps using a single standalone backend session. I had to remove the
code that currently prints out progress markers for individual phases
of that processing, so that now you get output that looks like

creating directory /home/postgres/testversion/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /home/postgres/testversion/data/base/1 ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Since initdb is just printing to the backend in an open-loop fashion,
it doesn't really know whether each command succeeds; in fact it is
usually pretty far ahead of the backend, until it does pclose() which
waits for the subprocess to exit. So we can't readily keep the old
progress markers. I don't think we'll miss them though. The whole
"post-bootstrap initialization" step only takes a second or so on my
main dev machine, so breaking down the progress more finely isn't that
useful anymore.

I had to change ;\n to ;\n\n in a few places in initdb's internal
scripts, to ensure that VACUUM commands were executed by themselves
(otherwise you get "VACUUM can't run in a transaction block" type
failures). I wasn't very thorough about that though, pending a
decision on exactly what the new command-boundary rule will be.

The upshot of these changes is that initdb runs about 10% faster overall
(more in -N mode), which is a useful savings. Also, the response to a
syntax error in information_schema.sql now looks like this:

creating template1 database in /home/postgres/testversion/data/base/1 ... ok
performing post-bootstrap initialization ... FATAL: column "routzine_schema" does not exist at character 243
HINT: Perhaps you meant to reference the column "routine_privileges.routine_schema".
STATEMENT:
/*
* 5.42
* ROLE_ROUTINE_GRANTS view
*/

CREATE VIEW role_routine_grants AS
SELECT grantor,
grantee,
specific_catalog,
specific_schema,
specific_name,
routine_catalog,
routzine_schema,
routine_name,
privilege_type,
is_grantable
FROM routine_privileges
WHERE grantor IN (SELECT role_name FROM enabled_roles)
OR grantee IN (SELECT role_name FROM enabled_roles);

child process exited with exit code 1
initdb: removing data directory "/home/postgres/testversion/data"

which is a *huge* improvement over what you got before.

What remains to be done before this'd be committable is updating
the documentation around the -j switch (or removing it altogether),
and going through initdb and its input scripts to ensure there's
a double newline after every nontrivial SQL command. I don't see
much point in finishing that detail work until we have consensus on
whether to use this or another command-boundary rule.

Thoughts?

regards, tom lane

Attachment Content-Type Size
initdb-improvement.patch text/x-diff 22.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-12-12 22:39:48 Re: PATCH: track last known XLOG segment in control file
Previous Message Peter Geoghegan 2015-12-12 22:28:50 Re: Using quicksort for every external sort run