Re: Documentation for bootstrap data conversion

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, John Naylor <jcnaylor(at)gmail(dot)com>
Subject: Re: Documentation for bootstrap data conversion
Date: 2018-04-06 19:42:34
Message-ID: 20180406194234.ekjzliqeq4prwnpr@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2018-04-06 14:27:39 -0400, Tom Lane wrote:
> John and I are probably both too close to the patch to be able to
> review this documentation for clarity and usefulness, so if anyone
> else wants to have a look, please comment.

Quick skim only:

> diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml
> index 33378b4..1e7915e 100644
> --- a/doc/src/sgml/bki.sgml
> +++ b/doc/src/sgml/bki.sgml
> @@ -1,38 +1,641 @@
> <!-- doc/src/sgml/bki.sgml -->
>
> <chapter id="bki">
> - <title><acronym>BKI</acronym> Backend Interface</title>
> + <title>System Catalog Declarations and Initial Contents</title>
>
> <para>
> - Backend Interface (<acronym>BKI</acronym>) files are scripts in a
> - special language that is understood by the
> - <productname>PostgreSQL</productname> backend when running in the
> - <quote>bootstrap</quote> mode. The bootstrap mode allows system catalogs
> - to be created and filled from scratch, whereas ordinary SQL commands
> - require the catalogs to exist already.
> - <acronym>BKI</acronym> files can therefore be used to create the
> - database system in the first place. (And they are probably not
> - useful for anything else.)
> + <productname>PostgreSQL</productname> uses many different system catalogs
> + to keep track of the existence and properties of database objects, such as
> + tables and functions. Physically there is no difference between a system
> + catalog and a plain user table, but the backend C code knows the structure
> + and properties of each catalog, and can manipulate it directly at a low
> + level. Thus, for example, it is inadvisable to attempt to alter the
> + structure of a catalog on-the-fly; that would break assumptions built into
> + the C code about how rows of the catalog are laid out. But developers
> + can change the structure of catalogs in new major versions.
> </para>

"developers" here could possibly be understood to be any sort of
developer, rather than postgres ones. Perhaps just say "But the
structure of the catalogs can change between major versions."?

> <para>
> - Related information can be found in the documentation for
> - <application>initdb</application>.
> + Many of the catalogs have initial data that must be loaded into them
> + during the <quote>bootstrap</quote> phase
> + of <application>initdb</application>, to bring the system up to a point
> + where it is capable of executing SQL commands. (For
> + example, <filename>pg_class.h</filename> must contain an entry for itself,
> + as well as one for each other system catalog and index.) Much of this
> + initial data is kept in editable form in data files that are also stored
> + in the <filename>src/include/catalog/</filename> directory. For example,
> + <filename>pg_proc.dat</filename> describes all the initial rows that must
> + be inserted into the <structname>pg_proc</structname> catalog.
> </para>
>
> + <para>
> + To create the catalog files and load this initial data into them, a
> + backend running in bootstrap mode reads a <acronym>BKI</acronym>
> + (Backend Interface) file containing commands and initial data.
> + The <filename>postgres.bki</filename> file used in this mode is prepared
> + from the aforementioned header and data files, by a Perl script
> + named <filename>genbki.pl</filename>, while building
> + a <productname>PostgreSQL</productname> distribution.
> + Although it's specific to a particular <productname>PostgreSQL</productname>
> + release, <filename>postgres.bki</filename> is platform-independent and is
> + normally installed in the <filename>share</filename> subdirectory of the
> + installation tree.
> + </para>

"normally"?

> + <sect1 id="system-catalog-declarations">
> + <title>System Catalog Declaration Rules</title>
> +
> + <para>
> + The key part of a catalog header file is a C structure definition
> + describing the layout of each row of the catalog. This begins with
> + a <literal>CATALOG</literal> macro, which so far as the C compiler is
> + concerned is just shorthand for <literal>typedef struct
> + FormData_<replaceable>catalogname</replaceable></literal>.
> + Each field in the struct gives rise to a catalog column.
> + Fields can be annotated using the BKI property macros described
> + in <filename>genbki.h</filename>, to define a default value, mark the
> + field as nullable or not nullable, or specify a lookup rule that allows
> + OID values to be represented symbolically in the
> + corresponding <filename>.dat</filename> file.

This sounds like an exhaustive list of what genbki.h allows - that seems
likely to get out of date...

> + <para>
> + The system catalog cache code (and most catalog-munging code in general)
> + assumes that the fixed-length portions of all system catalog tuples are
> + in fact present, because it maps this C struct declaration onto them.
> + Thus, all variable-length fields and nullable fields must be placed at
> + the end, and they cannot be accessed as struct fields.
> + For example, if you tried to
> + set <structname>pg_type</structname>.<structfield>typrelid</structfield>
> + to be NULL, it would fail when some piece of code tried to reference
> + <literal>typetup-&gt;typrelid</literal> (or worse,
> + <literal>typetup-&gt;typelem</literal>, because that follows
> + <structfield>typrelid</structfield>). This would result in
> + random errors or even segmentation violations.
> + </para>
> +
> + <para>
> + As a partial guard against this type of error, variable-length or
> + nullable fields should not be made directly visible to the C compiler.
> + This is accomplished by wrapping them in <literal>#ifdef
> + CATALOG_VARLEN</literal> ... <literal>#endif</literal>. This prevents C
> + code from carelessly trying to dereference fields that might not be
> + there.

Not just, also using an entirely wrong offset, no?

> As an independent guard against creating incorrect rows, we
> + require that all columns that should be non-nullable are marked so
> + in <structname>pg_attribute</structname>. The bootstrap code will
> + automatically mark catalog columns as <literal>NOT NULL</literal>
> + if they are fixed-width and are not preceded by any nullable column.
> + Where this rule is inadequate, you can force correct marking by using
> + <literal>BKI_FORCE_NOT_NULL</literal>
> + and <literal>BKI_FORCE_NULL</literal> annotations as needed. But note
> + that <literal>NOT NULL</literal> constraints are only enforced in the
> + executor, not against tuples that are generated by random C code,
> + so care is still needed when manually creating or updating catalog rows.
> + </para>

Hm. Not about docs, but I wonder if we should write a sanity check
regression test that makes sure that's not violated from somewhere?

> + <sect1 id="system-catalog-initial-data">
> + <title>System Catalog Initial Data</title>
> +
> + <para>
> + Each catalog that has any manually-created initial data (some do not)
> + has a corresponding <literal>.dat</literal> file that contains its
> + initial data in an editable format.
> + </para>
> +
> + <sect2 id="system-catalog-initial-data-format">
> + <title>Data File Format</title>
> +
> + <para>
> + Each <literal>.dat</literal> file contains Perl data structure literals
> + that are simply eval'd to produce an in-memory data structure consisting
> + of an array of hash references, one per catalog row.
> + A slightly modified excerpt from <filename>pg_database.dat</filename>
> + will demonstrate the key features:
> + </para>
> +
> +<programlisting>
> +[
> +
> +# LC_COLLATE and LC_CTYPE will be replaced at initdb time with user choices
> +# that might contain non-word characters, so we must double-quote them.

Hm. Couldn't we get rid of that requirement and do the escaping in
genbki? Seems awkward, failure prone (people will forget and it'll often
work during development) and unnecessary in the new format.

> + <listitem>
> + <para>
> + Null values are represented by <literal>_null_</literal>.
> + </para>
> + </listitem>

wonder if it'd be more natural to use $null or such for this kind of thing.

> + <listitem>
> + <para>
> + If the catalog's <literal>.h</literal> file specifies a default
> + value for a column, and a data entry has that same
> + value, <filename>reformat_dat_file.pl</filename> will omit it from
> + the data file. This keeps the data representation compact.
> + </para>
> + </listitem>

This'll be fun if we ever decide to change a default :)

> + <sect2 id="system-catalog-oid-assignment">
> + <title>OID Assignment</title>
> +
> + <para>
> + A catalog row appearing in the initial data can be given a
> + manually-assigned OID by writing an <literal>oid
> + =&gt; <replaceable>nnnn</replaceable></literal> metadata field.
> + Furthermore, if an OID is assigned, a C macro for that OID can be
> + created by writing an <literal>oid_symbol
> + =&gt; <replaceable>name</replaceable></literal> metadata field.
> + </para>
> +
> + <para>
> + Pre-loaded catalog rows must have preassigned OIDs if there are OID
> + references to them in other pre-loaded rows.

Why is that?

> A preassigned OID is
> + also needed if the row's OID must be referenced from C code.
> + If neither case applies, the <literal>oid</literal> metadata field can
> + be omitted, in which case the bootstrap code assigns an OID
> + automatically, or leaves it zero in a catalog that has no OIDs.
> + In practice we usually preassign OIDs for all or none of the pre-loaded
> + rows in a given catalog, even if only some of them are actually
> + cross-referenced.
> + </para>

I think we'd reduce the pain of maintaining uncommitted patches across a
few CFs if this were a more relaxed rule.

> + <para>
> + To find an available OID for a new pre-loaded row, run the
> + script <filename>src/include/catalog/unused_oids</filename>.
> + It prints inclusive ranges of unused OIDs (e.g., the output
> + line <quote>45-900</quote> means OIDs 45 through 900 have not been
> + allocated yet).

Might be worthwhile to not (or fix) that one has to be inside the
src/include/catalog/ directory to run it?

This is quite the nice improvement!

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-04-06 20:20:45 Re: WIP: a way forward on bootstrap data
Previous Message Marina Polyakova 2018-04-06 19:26:47 Re: Add support for printing/reading MergeAction nodes