Re: Bootstrap DATA is a pita

From: Caleb Welton <cwelton(at)pivotal(dot)io>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bootstrap DATA is a pita
Date: 2015-12-10 19:47:51
Message-ID: CAOjayEdTiViVianMEsXCZ2w9a+DYi40Jch66887+v9=TuLukLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Hackers,

Reviving an old thread on simplifying the bootstrap process.

I'm a developer from the GPDB / HAWQ side of the world where we did some
work a while back to enable catalog definition via SQL files and we have
found it valuable from a dev perspective. The mechanism currently in those
products is a bit.. convoluted where SQL is processed in perl to create the
existing DATA statements, which are then processed as they are today in
Postgres... I wouldn't suggest this route, but having worked with both the
DATA mechanism and the SQL based one I've certainly found SQL to be a more
convenient way of interacting with the catalog.

I'd propose:
- Keep enough of the existing bootstrap mechanism functional to get a
small tidy core, essentially you need enough of pg_type, pg_proc, pg_class,
pg_attribute to support the 25 types used by catalog tables and most
everything else can be moved into SQL processing like how system_views.sql
is handled today.

The above was largely proposed back in March and rejected based on
concerns that

1. initdb would be slower.
2. It would introduce too much special purpose bootstrap cruft into the
code.
3. Editing SQL commands is not comfortable in bulk

On 1.

I have a prototype that handles about 1000 functions (all the functions in
pg_proc.h that are not used by other catalog tables, e.g. pg_type,
pg_language, pg_range, pg_aggregate, window functions, pg_ts_parser, etc).

All of initdb can be processed in 1.53s. This compares to 1.37s with the
current bootstrap approach. So yes, this is slower, but not 'noticeably
slower' - I certainly didn't notice the 0.16s until I saw the concern and
then timed it.

On 2.

So far the amount of cruft has been:
- Enabling adding functions with specific OIDs when creating functions.
1 line changes in pg_aggregate.c, proclang.c, typecmds.c
about dozen lines of code in functioncmds.c
3 lines changed in pg_proc.c
- Update the fmgr_internal_validator for builtin functions while the
catalog is mutable
3 lines changed in pg_proc.c
- Update how the builtin function cache is built
Some significant work in fmgr.c that honestly still needs cleanup
before it would be ready to propose as a patch that would be worthy of
committing.
- Update how builtin functions are resolved outside of bootstrap
Minor updates to dynloader for lookup of symbols within the current
executable, so far I've only done darwin.c for my prototype, this would
need to be extended to the other ports.
- Initializitation of the builtin cache
2 line change in postinit.c
- Addition of a stage in initdb to process the sql directives similar in
scope to the processing of system_views.sql.

No changes needed in the parser, planner, etc. My assessment is that this
worry is not a major concern in practice with the right implementation.

On 3.

Having worked with both SQL and bki DATA directives I have personally found
the convenience of SQL outweighs the pain. In many cases changes, such as
adding a new column to pg_proc, have minimal impact on the SQL
representation and what changes are needed are often simple to implement.
E.g. accounting for COST only needs to be done for the functions that need
something other than the default value. This however is somewhat
subjective.

On the Pros side:

a. Debugging bootstrap is extremely painful, debugging once initdb has
gotten to 'postgres --single' is way easier.

b. It is easier to introduce minor issues with DATA directives than it is
when using the SQL processing used for all other user objects.

Example: currently in Postgres all builtin functions default to COST 1,
and all SQL functions default to cost 100. However the following SQL
functions included in bootstrap inexplicably are initialized with a COST of
1:
age(timestamp with time zone)
age(timestamp without time zone)
bit_length(bytea)
bit_length(text)
bit_length(bit)
date_part(text, abstime)
date_part(text, reltime)
date_part(text, date)
... and 26 other examples

c. SQL files are significantly less of a PITA (subjective opinion, but I
can say this from a perspective of experience working with both DATA
directives and SQL driven catalog definition).

If people have interest I can share my patch so far if that helps address
concerns, but if there is not interest then I'll probably leave my
prototype where it is rather than investing more effort in the proof of
concept.

Thanks,
Caleb

On Sat, Mar 7, 2015 at 5:20 PM, <pgsql-hackers-owner(at)postgresql(dot)org> wrote:

> Date: Sat, 7 Mar 2015 23:46:54 +0100
> From: Andres Freund <andres(at)2ndquadrant(dot)com>
> To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
> Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com
> >,
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>,
> Josh Berkus <josh(at)agliodbs(dot)com>,
> "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
> Subject: Re: Bootstrap DATA is a pita
> Message-ID: <20150307224654(dot)GC12213(at)awork2(dot)anarazel(dot)de>
>
> On 2015-03-07 16:43:15 -0600, Jim Nasby wrote:
> > Semi-related... if we put some special handling in some places for
> bootstrap
> > mode, couldn't most catalog objects be created using SQL, once we got
> > pg_class, pg_attributes and pg_type created? That would theoretically
> allow
> > us to drive much more of initdb with plain SQL (possibly created via
> > pg_dump).
>
> Several people have now made that suggestion, but I *seriously* doubt
> that we actually want to go there. The overhead of executing SQL
> commands in comparison to the bki stuff is really rather
> noticeable. Doing the majority of the large number of insertions via SQL
> will make initdb noticeably slower. And it's already annoyingly
> slow. Besides make install it's probably the thing I wait most for
> during development.
>
> That's besides the fact that SQL commands aren't actually that
> comfortably editable in bulk.
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-12-10 19:56:33 Re: Inaccurate results from numeric ln(), log(), exp() and pow()
Previous Message Simon Riggs 2015-12-10 19:47:30 Re: Speedup twophase transactions