Quick Links

WIP: a way forward on bootstrap data

From:	John Naylor <jcnaylor(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	WIP: a way forward on bootstrap data
Date:	2017-12-04 10:03:27
Message-ID:	CAJVSVGWO48JbbwXkJz_yBFyGYW-M9YWxnPdxJBUosDC9ou_F0Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I was looking through the archives one day for a few topics that
interest me, and saw there was continued interest in making bootstrap
data less painful [1] [2]. There were quite a few good ideas thrown
around in those threads, but not much in the way of concrete results.
I took a few of them as a starting point and threw together the
attached WIP patchset.

==
An overview (warning: long):

1 through 3 are small tweaks worth doing even without a data format
change. 4 through 7 are separated for readability and flexibility, but
should be understood as one big patch. I tried to credit as many
people's ideas as possible. Some things are left undone until basic
agreement is reached.

--
Patch 1 - Minor corrections

--
Patch 2 - Minor cleanup

Be more consistent style-wise, change a confusing variable name, fix
perltidy junk.

--
Patch 3 - Remove hard-coded schema information about pg_attribute from
genbki.pl.

This means slightly less code maintenance, but more importantly it's a
proving ground for mechanisms used in later patches.

1. Label a column's default value in the catalog header [3].

Note: I implemented it in the simplest way possible for now, which
happens to prevents a column from having both FORCE_(NOT_)NULL and a
default at the same time, but I think in practice that would almost
never matter. If more column options are added in the future, this
will have to be rewritten.

2. Add a new function to Catalog.pm to fill in a tuple with default
values. It will complain loudly if it can't find either a default or a
given value, so change the signature of emit_pgattr_row() so we can
pass a partially built tuple to it.

3. Format the schema macro entries according to their types.

4. Commit 8137f2c32322c624e0431fac1621e8e9315202f9 labeled which
columns are variable length. Expose that label so we can exclude those
columns from schema macros in a general fashion.

--
Patch 4 - Infrastructure for the data conversion

1. Copy a modified version of Catalogs() from Catalog.pm to a new
script that turns DATA()/(SH)DESCR() statements into serialized Perl
data structures in pg_*.dat files, preserving comments along the way.
Toast and index statements are unaffected. Although it's a one-off as
far as the core project is concerned, I imagine third-parties might
want access to this tool, so it's in the patch and not separate.

2. Remove data parsing from the original Catalogs() function and
rename it to reflect its new, limited role of extracting the schema
info from a header. The data files are handled by a new function.

3. Introduce a script to rewrite pg_*.dat files in a standard format
[4], strip default values, preserve comments and normalize blank
lines. It can also change default values on the fly. It is intended to
be run when adding new data.

4. Add default values to a few catalog headers for demonstration
purposes. There might be some questionable choices here. Review is
appreciated.

Note: At this point, the build is broken.

TODO: See what pgindent does to the the modified headers.

--
Patch 5 - Mechanical data conversion

This is the result of:

cd src/include/catalog
perl convert_header2dat.pl list-of-catalog-headers-from-the-Makefile
perl -I ../../backend/catalog rewrite_dat.pl *.dat
rm *.bak

Note: The data has been stripped of all double-quotes for readability,
since the Perl hash literals have single quotes everywhere. Patches 6
and 7 restore them where needed.

--
Patch 6 - Hand edits

Re-doublequote values that are macros expanded by initdb.c, remove
stray comments, fix up whitespace, and do a minimum of comment editing
to reflect the new data format.

At this point the files are ready to take a look at. Format is the
central issue, of course. I tried to structure things so it wouldn't
be a huge lift to bikeshed on the format. See pg_authid.dat for a
conveniently small example. Each entry is 1 or 2 lines long, depending
on whether oid or (sh)descr is present.

Regarding pg_proc.dat, I think readability is improved, and to some
extent line length, although it's not great:

pg_proc.h: avg=175, stdev=25
pg_proc.dat: avg=159, stdev=43

(grep -E '^DATA' pg_proc.h | awk '{print length}'
grep prosrc pg_proc.dat | awk '{print length}')

Many lines now fit in an editor window, but the longest line has
ballooned from 576 chars to 692. I made proparallel default to 'u' for
safety, but the vast majority are 's'. If we risked making 's' the
default, most entries would shrink by 20 chars. On the other hand,
human-readable types would inflate that again, but maybe patch 8 below
can help with editing. There was some discussion about abbreviated
labels that were mapped to column names - I haven't thought about that
yet.

--
Patch 7 - Distprep scripts

1. Teach genbki.pl and Gen_fmgrtab.pl to read the data files, and
arrange for the former to double-quote certain values so bootscanner.l
can read them.

2. Introduce Makefile dependencies on the data files.

The build works now.

Note: Since the original DATA() entries contained some (as far as I
can tell) useless quotes, the postgres.bki diff won't be zero, but it
will be close.

Performance note: On my laptop, running Make in the catalog dir with
no compilation goes from ~700ms on master to ~1000ms with the new data
files.

--
Patch 8 - SQL output

1. Write out postgres.sql, which allows you to insert all the source
BKI data into a development catalog schema for viewing and possibly
bulk-editing [5]. It retains oids, and (sh)descr fields in their own
columns, and implements default values.

2. Make it a distclean target.

TODO: Find a way to dump dev catalog tuples into the canonical format.

--
Not implemented yet:

-Gut the header files of DATA() statements. I'm thinking we should
keep the #defines in the headers, but see below:
-Update README and documentation

--
Future work:
-More lookups for human-readable types, operators, etc.
-Automate pg_type #defines a la pg_proc [6], which could also be used
to maintain ecpg's copy of pg_type #defines.

--
[1] https://www.postgresql.org/message-id/flat/20150220234142.GH12653%40awork2.anarazel.de

[2] https://www.postgresql.org/message-id/flat/CAGoxFiFeW64k4t95Ez2udXZmKA%2BtazUFAaSTtYQLM4Jhzw%2B-pg%40mail.gmail.com

[3] https://www.postgresql.org/message-id/20161113171017.7sgaqdeq6jslmsr3%40alap3.anarazel.de

[4] https://www.postgresql.org/message-id/D8F1D509-6498-43AC-BEFC-052DFE16847A%402ndquadrant.com

[5] https://www.postgresql.org/message-id/20150304150712.GV29780%40tamriel.snowman.net

[6] https://www.postgresql.org/message-id/15697.1479161432%40sss.pgh.pa.us

I'll add this to the next commitfest soon.

-John Naylor

Attachment	Content-Type	Size
0001_corrections_v1.patch	text/x-patch	5.3 KB
0002_cleanup_v1.patch	text/x-patch	10.2 KB
0003_pgattr_schema_isolation_v1.patch	text/x-patch	17.7 KB
0004_conversion_scripts_and_headers_v1.patch	text/x-patch	53.7 KB
0005_data_files_mechanical_v1.patch.tar.gz	application/x-gzip	106.0 KB
0006_data_files_hand_edits_v1.patch	text/x-patch	110.8 KB
0007_update_distprep_scripts_v1.patch	text/x-patch	14.4 KB
0008_dev_sql_v1.patch	text/x-patch	10.3 KB

Responses

Re: WIP: a way forward on bootstrap data at 2017-12-13 09:06:35 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Raúl Marín Rodríguez	2017-12-04 10:06:07	Re: [HACKERS] pow support for pgbench
Previous Message	Raúl Marín Rodríguez	2017-12-04 09:27:39	Re: [HACKERS] pgbench more operators & functions