diff --git a/doc/src/sgml/architecture.sgml b/doc/src/sgml/architecture.sgml
index b7589f9a4f..4bbd6abb8a 100644
--- a/doc/src/sgml/architecture.sgml
+++ b/doc/src/sgml/architecture.sgml
@@ -6,8 +6,11 @@
Every DBMS implements basic strategies to ensure a fast
and robust system. This chapter provides an overview of the
- techniques PostgreSQL uses to
- achieve this.
+ basic techniques PostgreSQL uses to
+ achieve this aim. It does not offer anything which exceeds
+ the information content of other pages. Instead, it tries to
+ explain why certain implementation
+ decisions have been taken.
@@ -28,10 +31,11 @@
- The first step when an Instance starts is the start of the
+ All aspects of an Instance are launched and managed using a single primary
+ process termed the
Postmaster.
- It loads the configuration files, allocates Shared Memory, and
- starts the other processes of the Instance:
+ It loads configuration files, allocates Shared Memory, and
+ starts the other collaborating processes of the Instance:
Background Writer,
Checkpointer,
WAL Writer,
@@ -39,9 +43,10 @@
Autovacuum,
Statistics Collector,
Logger, and more.
- Later, the Postmaster starts
+ Later, the Postmaster listens on its configured system port and in response
+ to client connection attempts launches
Backend processes
- which communicate with clients and handle their requests.
+ to which it delegates authentication, communication, and the handling of their requests.
visualizes the processes
of an Instance and the main aspects of their collaboration.
@@ -62,14 +67,6 @@
-
- When a client application tries to connect to a
- database,
- this request is handled initially by the Postmaster. It
- starts a new Backend process, which handles all further
- client's requests.
-
-
Client requests like SELECT or
UPDATE usually lead to the
@@ -77,16 +74,16 @@
by the client's backend process. Reads involve a page-level
cache, located in Shared Memory (for details see:
) for the benefit of all processes
- in the instance. Writes also use this cache, in addition
+ in the Instance. Writes also use this cache, in addition
to a journal, called the write-ahead-log or WAL.
Shared Memory is limited in size and it can become necessary
to evict pages. As long as the content of such pages hasn't
- changed, this is not a problem. But in Shared Memory also
- write actions take place. Modified pages are called dirty
- pages or dirty buffers and before they can be evicted they
+ changed, this is not a problem. But writes directly modify
+ the pages in Shared Memory. Modified pages are called dirty
+ pages (or dirty buffers) and before they can be evicted they
must be written to disk. This happens regularly by the
Checkpointer and the Background Writer processes to ensure
that the disk version of the pages are up-to-date.
@@ -98,7 +95,7 @@
WAL record
is created from the delta-information (difference between the
old and the new content) and stored in another area of
- Shared Memory. The parallel running WAL Writer process
+ Shared Memory. The concurrently running WAL Writer process
reads them and appends them to the end of the current
WAL file.
Such sequential writes are faster than writes to random
@@ -108,8 +105,8 @@
- Second, the transfer of dirty buffers from Shared Memory to
- files must take place. This is the primary task of the
+ Second, the Instance transfers dirty buffers from Shared Memory to
+ files. This is the primary task of the
Background Writer process. Because I/O activities can block
other processes, it starts periodically and
acts only for a short period. Doing so, its extensive (and
@@ -123,14 +120,8 @@
Checkpoints.
A Checkpoint is a point in time when all older dirty buffers,
all older WAL records, and finally a special Checkpoint record
- are written and flushed to disk. Heap and index files,
- and WAL files are now in sync.
- Older WAL is no longer required. In other words,
- a possibly occurring recovery, which integrates the delta
- information of WAL into heap and index files, will happen
- by replaying only WAL past the last-recorded checkpoint.
- This limits the amount of WAL to be replayed
- during recovery in the event of a crash.
+ are written and flushed to disk.
+ Older WAL files are no longer required to recover the system from a crash.
@@ -141,8 +132,10 @@
less common). Options and details are covered
in the backup and restore section ().
For our purposes here, just note that the WAL Archiver process
- can be enabled and configured to run a script on filled WAL
- files — usually to copy them to a remote location.
+ can be enabled and configured to run a script on completed WAL
+ files — usually to copy them to a remote location. Note
+ that when a Checkpoint record is written to the WAL the current
+ file is immediately completed.
@@ -163,23 +156,35 @@
The logical Perspective: Cluster, Database, Schema
- A server contains one or more
- database clusters
- (clusters
- for short). Each cluster contains three or more
- databases.
- Each database can contain many
- schemas.
- A schema can contain
- tables,
+ A Server contains one or more
+ Database Clusters
+ (Clusters
+ for short). By default each newly initialized Cluster contains three
+ databases
+ (one interactive and two templates, see ).
+ Each database can contain many user-writable
+ schemas
+ (public, by name and permissiveness, by default), the system
+ generated user-facing schemas pg_catalog,
+ pg_temp, and information_schema,
+ and some more system schemas.
+ Tables,
views, and a lot
- of other objects. Each table or view belongs to a single schema
- only; they cannot belong to another schema as well. The same is
- true for the schema/database and database/cluster relation.
+ of other objects uniquely reside in a single schema.
visualizes
this hierarchy.
+
+
+ Client connections act at the database level and can access
+ its schemas simultaneously. Special techniques like
+ foreing data wrapper
+ or dblink are required
+ to access multiple databases, even within the same Cluster,
+ from a single client connection.
+
+
- A cluster is the outer container for a
+ A Cluster is the outer container for a
collection of databases. Clusters are created by the command
.
template0 is the very first
- database of any cluster. Database template0
- is created during the initialization phase of the cluster.
+ database of any Cluster. It
+ is created during the initialization phase of the Cluster.
In a second step, database template1 is generated
as a copy of template0, and finally database
postgres is generated as a copy of
template1. Any
new databases
- of the cluster that a user might need,
+ of the Cluster that a user might need,
such as my_db, will be copied from the
template1 database. Due to the unique
role of template0 as the pristine original
- of all other databases, no client can connect to it.
+ of all other databases, no client is allowed to connect to it.
- Every database must contain at least one schema because all
SQL Objects
- must be contained in a schema.
+ are contained in a schema.
Schemas are namespaces for SQL objects and ensure
(with one exception) that the SQL object names are used only once within
their scope across all types of SQL objects. E.g., it is not possible
@@ -243,26 +247,46 @@
without using an explicit schema name. public
should not contain user-defined SQL objects. Instead, it is
recommended to create a separate schema that holds individual
- objects like application-specific tables or views.
+ objects like application-specific tables or views. To access
+ objects in such a schema they can be fully qualified, e.g.
+ my_schema.my_table, or by changing the
+ schema search path.
+
+
+ pg_catalog is a schema for all tables and views of the
System Catalog.
- information_schema is a schema for several
- tables and views of the System Catalog in a way that conforms
- to the SQL standard.
+ information_schema is a similar schema. It
+ contains several tables and views of the System Catalog in a
+ way that conforms to the SQL standard.
- There are many different SQL object
- types: database, schema, table, view, materialized
- view, index, constraint, sequence, function, procedure,
- trigger, role, data type, operator, tablespace, extension,
- foreign data wrapper, and more. A few of them, the
+ There are many different SQL object types:
+ database,
+ schema,
+ table,
+ view,
+ materialized view,
+ index,
+ constraint,
+ sequence,
+ function,
+ procedure,
+ trigger,
+ role,
+ data type,
+ operator,
+ tablespace,
+ extension,
+ foreign data wrapper,
+ and more. A few of them, the
Global SQL Objects, are outside of the
strict hierarchy: All database names,
all tablespace names, and all
role names are automatically
- available throughout the cluster, independent from
- the database or schema in which they were defined originally.
+ available throughout the Cluster, not just the database in which
+ the SQL command was executed.
shows the relation between the object types.
@@ -286,7 +310,7 @@
- The physical Perspective: Directories and Files
+ The Physical Perspective: Directories and FilesPostgreSQL organizes long-lasting (persistent)
@@ -297,7 +321,7 @@
variable PGDATA points to this directory.
The example shown in
uses
- data as the name of this root directory.
+ data as the name of the cluster's root directory.
- data contains many subdirectories and
+ The cluster's root directory contains many subdirectories and
some files, all of which are necessary to store long-lasting
- as well as temporary data. The following paragraphs
- describe the files and subdirectories in
- data.
+ as well as temporary data. The root's name can be selected
+ as desired, but the names of its subdirectories and files
+ are more or less fix and detertermined by
+ PostgreSQL. The following
+ paragraphs describe the most important subdirectories
+ and files.
- base is a subdirectory in which one
- subdirectory per database exists. The names of those
+ base contains one
+ subdirectory per database. The names of those
subdirectories consist of numbers. These are the internal
Object Identifiers (OID), which are numbers to identify
- the database definition in the
+ their definition in the
System Catalog.
- Within the database-specific
- subdirectories, there are many files: one or more for
- every table and every index to store heap and index
- data. Those files are accompanied by files for the
+ Within the database-specific subdirectories of base
+ there are many files: one or more for every table
+ and every index. Those files are accompanied by files for the
Free Space Maps
(suffixed _fsm) and
Visibility Maps
@@ -345,20 +371,25 @@
- Another subdirectory is global which
+ Another subdirectory is global. It
contains files with information about
Global SQL Objects.
- One type of such Global SQL Objects are
- tablespaces.
- In global there is information about
- the tablespaces; not the tablespaces themselves.
+
+
+
+ In pg_tblspc, there are symbolic links
+ that point to directories that are outside of the root
+ directory tree, e.g. at a different disc. Files for tables
+ and indexes of non-default tablespaces reside there. As
+ previously mentioned, those defined within the default
+ tablespace reside in the database-specific subdirectories.
The subdirectory pg_wal contains the
WAL files.
They arise and grow in parallel with data changes in the
- cluster and remain as long as
+ Cluster and remain as long as
they are required for recovery, archiving, or replication.
@@ -370,19 +401,14 @@
- In pg_tblspc, there are symbolic links
- that point to directories containing SQL objects
- that exist within a non-default tablespace.
-
-
-
- In the root directory data
+ In the root directory
there are also some files. In many cases, the configuration
- files of the cluster are stored here. If the
- instance is up and running, the file
+ files of the Cluster are stored here. If the
+ Instance is up and running, the file
postmaster.pid exists here
+ (by default)
and contains the process ID (pid) of the
- Postmaster which started the instance.
+ Postmaster which started the Instance.
@@ -413,8 +439,8 @@
PostgreSQL implements a
sophisticated technique which avoids locking:
Multiversion Concurrency Control (MVCC).
- The advantage of MVCC
- over technologies that use row locks becomes evident in multiuser OLTP
+ The advantage of MVCC over technologies that use row locks
+ becomes evident in multiuser Online Transaction Processing (OLTP)
environments with a massive number of concurrent write
actions. There, MVCC generally performs better than solutions
using locks. In a PostgreSQL
@@ -439,15 +465,15 @@
- When we speak about transaction IDs, you need to know that xids are like
- sequences. Every new transaction receives the next number as its ID.
- Therefore, this flow of xids represents the flow of transaction
- start events over time. But keep in mind that xids are independent of
- any time measurement — in milliseconds or otherwise. If you dive
- deeper into PostgreSQL, you will recognize
- parameters with names such as 'xxx_age'. Despite their names,
- these '_age' parameters do not specify a period of time but represent
- a certain number of transactions, e.g., 100 million.
+ Xids are sequences (with a reserved value to handle wrap-around
+ in pre-9.4 PostgreSQL versions).
+ Age computations involving them measure a transaction
+ count as opposed to a time interval (in milliseconds or otherwise).
+ If you dive deeper into PostgreSQL,
+ you will recognize parameters with names such as 'xxx_age'.
+ Despite their names, these '_age' parameters do not specify
+ a period of time but represent a certain number of transactions,
+ e.g., 100 million.
@@ -498,7 +524,7 @@
executes an UPDATE of this row by
changing the user data from 'x' to
'y'. According to the MVCC principles,
- the data in the old version of the row is not changed!
+ the data in the old version of the row is not changed.
The value 'x' remains as it was before.
Only xmax changes to 135.
Now, this version is treated as valid exclusively for
@@ -526,7 +552,7 @@
Finally, a row may be deleted by a DELETE
command. Even in this case, all versions of the row remain as
- before. Nothing is thrown away! Only xmax
+ before; nothing is thrown away. Only xmax
of the last version is set to the xid of the DELETE
transaction, which indicates that (if committed) it is only visible to
transactions with xids older than that (from
@@ -545,7 +571,8 @@
Over time, also the older ones get out of scope
for ALL transactions and therefore become unnecessary.
Nevertheless, they do exist physically on the disk and occupy
- space.
+ space. They are called dead rows and are part
+ of the bloat.
@@ -560,12 +587,7 @@
row versions are valid (visible) for transactions.
This range doesn't imply any direct temporal meaning;
the sequence of xids reflects only the sequence of
- transaction begin events. As
- xids grow, old row versions get out of scope over time.
- If an old row version is no longer relevant for ANY existing
- transactions, it can be marked dead. The
- space occupied by dead row versions is part of the
- bloat.
+ transaction begin events.
@@ -581,9 +603,9 @@
Nothing is removed — with the consequence that the database
occupies more and more disk space. It is obvious that
- this behavior has to be corrected in some
- way. The next chapter explains how autovacuum
- fulfills this task.
+ this behavior has to be corrected in some way. The next
+ chapter explains how vacuum and
+ autovacuum fulfill this task.
@@ -601,12 +623,12 @@
This chapter explains how the SQL command
VACUUM and the automatically running
Autovacuum processes clean up
- and avoid continued growth.
+ and prevent continued growth.
- Autovacuum runs automatically by
+ Autovacuum runs automatically, by
default. Its default parameters as well as those for
VACUUM are appropriate for most standard
situations. Therefore a novice database manager can
@@ -617,11 +639,11 @@
Client processes can issue the SQL command VACUUM
- at arbitrary points in time. DBAs do this when they recognize
+ at any time. DBAs do this when they recognize
special situations, or they start it in batch jobs which run
- periodically. Autovacuum processes run as part of the
- Instance at the server.
- There is a constantly running Autovacuum daemon. It continuously
+ periodically. Additionally, there is a constantly running
+ Autovacuum daemon which is part of the
+ Instance. It continuously
monitors the state of all databases based on values that are collected by the
Statistics Collector
and starts Autovacuum processes whenever it detects
@@ -1234,7 +1256,7 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
Lastly, it is worth noticing that changes done by a
committed transaction will survive all failures in the application or
- database cluster. The next chapter explains the
+ the Database Cluster. The next chapter explains the
durability
guarantees.
@@ -1276,7 +1298,7 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
Instance failure
- The instance may suddenly fail because of power off
+ The Instance may suddenly fail because of power off
or other problems. This will affect all running processes, the RAM,
and possibly the consistency of disk files.
@@ -1284,7 +1306,7 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
After a restart, PostgreSQL
automatically recognizes that the last shutdown of the
- instance did not happen as expected: files might not be
+ Instance did not happen as expected: files might not be
closed properly and the postmaster.pid
file unexpectedly exists. PostgreSQL
tries to clean up the situation. This is possible because
@@ -1330,7 +1352,7 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
They obviously need a backup. How to take such a backup
and use it as a starting point for a recovery of the
- cluster is explained in more detail in the next
+ Cluster is explained in more detail in the next
chapter.
@@ -1396,7 +1418,7 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
You can use any appropriate OS tool to create a
copy
- of the cluster's directory structure and files. In
+ of the Cluster's directory structure and files. In
case of severe problems such a copy can serve as
the source of recovery. But in order to get a
USABLE backup by this method,
@@ -1416,7 +1438,7 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
The tool pg_dump is able to take a
copy
- of the complete cluster or certain parts of it. It stores
+ of the complete Cluster or certain parts of it. It stores
the copy in the form of SQL commands like CREATE
and COPY. It runs in
parallel with other processes, in its own transaction.
@@ -1430,9 +1452,9 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
The main advantage over the other two methods is that it
- can pick parts of the cluster, e.g., a single table or one
+ can pick parts of the Cluster, e.g., a single table or one
database. The other two methods work only at the level of
- the complete cluster.
+ the complete Cluster.
Continuous archiving based on pg_basebackup and WAL files
@@ -1447,7 +1469,7 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
basebackup with the tool
pg_basebackup. The result is a
directory structure plus files which contain a
- consistent copy of the original cluster.
+ consistent copy of the original Cluster.
pg_basebackup runs in
parallel with other processes in its own transaction.
@@ -1456,9 +1478,9 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
The second step is recommended but not necessary. All
changes to the data are stored in WAL files. If you
continuously save such WAL files, you have the history
- of the cluster. This history can be applied to a
+ of the Cluster. This history can be applied to a
basebackup in order to recreate
- any state of the cluster between the time of
+ any state of the Cluster between the time of
pg_basebackup's start time and
any later point in time. This technique
is called 'Point-in-Time Recovery (PITR)'.
@@ -1478,7 +1500,7 @@ UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob';
- If it becomes necessary to restore the cluster, you have to
+ If it becomes necessary to restore the Cluster, you have to
copy the basebackup and the archived WAL files to
their original directories. The configuration of this
recovery procedure