Re: Additional Chapter for Tutorial

From: Jürgen Purtz <juergen(at)purtz(dot)de>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Subject: Re: Additional Chapter for Tutorial
Date: 2020-11-15 18:45:35
Message-ID: b4124734-9d3f-849d-b408-b122b624772c@purtz.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On 10.11.20 00:14, David G. Johnston wrote:
> Reviewed the first three sections.
>
> template0 - I would remove the schema portions of this and simply note
> this as being a pristine recovery database in the diagram.
ok
>
> I would drop the word "more" and just say "system schemas".  I would
> drop pg_toast from the list of system schema and focus on the three
> user-facing ones.
ok
>
> Instead of "my_schema" (optional) I would do "my_schema" (example)
The terms 'optional' and 'default' are used at various places with their
literal meaning. We shall not change them.
>
> Server Graphic
> #3 Global SQL Objects: Objects which are shared among all databases
> within a cluster.
> #6 Client applications are prohibited from connecting to template0
ok
> #1 If by you we mean "the client" saying that you work "in the cluster
> data" doesn't really help.  I would emphasize the point that the
> client sees an endpoint the Postmaster publishes as a port or socket
> file and that plus the database name defines the endpoint the client
> connects to (meld with #5)
ok, with some changes.
>
> In lieu of some of the existing detail provided about structure I
> would add information about configuration and search_path at this level.
Search path appended. But IMO configuration questions are out of scope
of this sub-chapter.
>
> I like the object type enumeration - I would suggest grouping them by
> type in a manner consistent with the documentation and making each one
> a link to its "primary" section - the SQL Command reference if all
> else fails.
ok. But don't how to group them in a better way.
>
> The "i" in internal in 51.3 (the image) needs capitalization).
ok
>
> You correctly add both Extension and Collation as database-level
> objects but they are not mentioned anywhere else.  They do belong here
> and need to be tied in properly in the text.
Have some courage to the gap, it's an introductory chapter.
>
> The whole thing needs a good pass focused on capitalization.  Both for
> typos and to decide when various primary concepts like Instance should
> be capitalized and when not.
'Instance' and 'Cluster' are now uppercase because of their importance,
everything else lowercase for better reading.
>
> 51.4 - When you look at the diagram seeing /pg/data/base looks really
> cool, but when reading the prose where both the "pg" and the "base"
> are omitted and all you get are repeated references to "data", the
> directory name choice becomes an issue IMO.  I suggest (and changed
> the attached) to name the actual root directory "pgdata".  You should
> change the /pg/ directory name to something like ".../tutorial_project/".

The graphic shall reflect the default behavior of PG. Without the
parameter -D, initdb creates the new cluster in the directory where
PGDATA points to. This is in many cases |/var/lib/pgsql/data|. Therefore
'data' and its subdirectory 'base' are not my invention but reflects the
default situation.

(Diving a little deeper into this issue I noticed that there is a
parameter 'cluster_name' in the config file. But it does not change the
name of the cluster's root directory, it only changes the names of the
running processes. Choosing 'instance_name' instead of 'cluster_name' as
the parameter's name would be a better choice imo - but that is not what
we are speaking about in the context of the new chapter).

I changed the very first directory in the graphic to visualize the
standard behavior; I reverted your recommendation to use 'pgdata'
instead of 'data' in the text part.

> Since you aren't following alphabetical order anyway I would place
> pg_tblspc after globals since tablespaces are globals and thus
> proximity links them here - and pointing out that pg_tblspc holds the
> data makes stating that global doesn't contain tablespace data
> unnecessary.
ok
>
> Maybe point out somewhere the the "base/databaseOID" directory
> represents the default tablespace for each database, which isn't
> "global", only the non-default tablespaces are considered globals (or
> just get rid of the mentioned on "non-default tablespace" for now).

ok

more:

1) some changes concerning the nature of connections (52.2: logical
perspective). IMO accessing multiple databases within one connection is
not a question of configuring, you have to take more actions. But I'm
not sure we should mention this at all.

2) you propose to cancel or trim down the paragraphs behind figure 51.2.
(cluster, database, schema). I believe that a textual description of
this hierarchy is essential for the understanding of the system. Because
it isn't described explicitly at a different place, it should remain.

--- snipp -------- from other e-mail ----

> MVCC Section
>
> The first paragraph and example in the MVCC section is a good example
> but seems misplaced - its relationship to MVCC generally is tenuous,
> rather I would expect a discussion of the serializable isolation mode
> to follow.
>
> I'm not sure how much detail this section wants to get into given the
> coverage of concurrency elsewhere in the documentation.  "Not much"
> would be my baseline.
The paragraph focus on the fact that new row versions are generated
instead of locking something. Explaining serialization isolation modes
is imo very complicate and out of the scope of this subchapter. If we
want to give an overview - in addition to the exiting documentation - it
should be a separate subchapter.
>
> I would suggest spelling out what "OLTP" stands for and ideally
> pointing the user to the glossary for the term.
ok, but not added to glossary. The given explanation "... with a massive
number of concurrent write actions" should be sufficient.
>
> Tending more toward a style gripe but the amount of leader phrases and
> redundancy are at a level that I am noticing them when I read this but
> do not have the same impression having read large portions of
> documentation. In particular:
Because I'm not a native English speaker, orthographic and style hits
are always welcome.
>
> "When we speak about transaction IDs, you need to know that xids are
> like sequences."
>
> "But keep in mind that xids are independent of any time measurement —
> in milliseconds or otherwise. If you dive deeper into PostgreSQL, you
> will recognize parameters with names such as 'xxx_age'. Despite their
> names, these '_age' parameters do not specify a period of time but
> represent a certain number of transactions, e.g., 100 million."
>
> Could just be:  xids are sequences and age computations involving them
> measure a transaction count as opposed to a time interval.
ok
>
> Then I would consider adding a bit more detail/context here.
>
> xids are 32bit sequences, with a reserved value to handle
> wrap-around.  There are 4 billion values in the sequence but
> wrap-around handling must occur every 2 billion transactions. Age
> computations involving xids measure a transaction count as opposed to
> a time interval.
>
> I would move the mentioning of "vacuum" to the main paragraph about
> delete and not solely as a "keep in mind" note.
The mentioning here at the food of the page is a crossover to the next
subchapter.
>
> The part before the diagram seems like it should be much shorter,
> concise, and provide links to the excellent documentation.  The part
> after the image, and the image itself, are good material, though
> possibly should be in a main administration chapter instead of an
> internals chapter.

vacuum: The problem - and one reason for the existence of this
subchapter - is that vacuum's documentation is scattered across may pages:

19.4: parameters to configure the server, especially five parameters
'vacuum_cost_xxx'.

19.10: parameters to configure autovacuum.

19.11: parameters to configure client connections, especially five
parameters 'vacuum_xxx' concerning their freeze-behavior.

24.1: explains the general necessity of (auto)vacuum and their strategies.

The page about the SQL command VACUUM explains the different options
(FULL, FREEZE, ..) and their meaning.

Because of the structure of our documentation as well as the complexity
of the issue that's ok. The existing documentation describes every
parameter very well, but I'm missing a page where the 'big picture' of
vacuum is explained (not necessarily here). It shall show the
relationship between the huge number of parameters and an explanation
*why* they exists. As far as we don't have such a page within the vacuum
documentation the proposed subchapter fills the gap. (The provided
graphics can be included multiple times without generating redundancies
- here and at arbitrary other places.)

>
> The first bullet of "keep in mind" is both wordy and wrong - in
> particular "as xids grow old row versions get out of scope over time"
> doesn't make sense (or rather it only does in the context of
> wrap-around, not normal visibility).  Having the only mention of bloat
> be here is also not ideal, it too should be weaved into the main
> narrative.  The "keep in mind" section here should be a recap of
> already covered material in a succinct form, nothing should be new to
> someone who just read the entire section.
ok.
>
> I don't think that usage of exclamation marks (!) is warranted here,
> though emphasis on the key phrase wouldn't hurt.
ok
>
> Vacuum Section
>
> avoid -> prevent (continued growth)
ok
>
> Autovacuum is enabled by default.  The whole note needs commas.
ok
>
> I'd try to get rid of "at arbitrary point in time"
ok
>
> "Instance." we've already described where instances are previously
> ("on the server")
ok
>
> The other sections - these seem misplaced for the tutorial, update the
> main documentation if this information is wholly missing or lacking. 
> The MVCC chapter can incorporate overview information as it is a
> strict consequence of that implementation.
>
> Statistics belong elsewhere - the tutorial should not use poor command
> implementation choices as a guide for user education.
>
> In short, this whole section should not exist and its content moved to
> more appropriate areas (mainly MVCC).  Vacuum is a tool that one must
> use but the narrative should be about the system generally.
>
>
concerning vacuum section: see my comments above

concerning 'the other sections' (transactions, reliability, backup
(plus: someone should add 'replication', I'm not familiar with this
issue)): The intention of the chapter is to give a *summary* about PG's
essential architecture and about central implementation aspects. This
implies that the chapters does not present any new information. They
shall only show (or repeat) essential things in their context and
explain *why* they are used. In this sense the three chapters may be
reasonable. Concerning this, I like to hear some comments from other people.

Attachments:

0013-architecture.patch: complete patch vs. master

0013-architecture.sgml.diff: changes in file architecture.sgml since 0012

0013-images.diff: changes in files *-raw.svg since 0012

--

J. Purtz

Attachment Content-Type Size
0013-architecture.patch text/x-patch 228.0 KB
0013-architecture.sgml.diff text/x-patch 28.3 KB
0013-images.diff text/x-patch 14.5 KB

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Fujii Masao 2020-11-16 15:26:29 doc: cosmetic changes in index items
Previous Message David G. Johnston 2020-11-15 06:25:43 Re: 42.6.8 trapping errors

Browse pgsql-hackers by date

  From Date Subject
Next Message Victor Yegorov 2020-11-15 22:29:08 Re: Deleting older versions in unique indexes to avoid page splits
Previous Message Tomas Vondra 2020-11-15 18:22:59 Re: list of extended statistics on psql