Re: POC: Cleaning up orphaned files using undo logs

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2018-11-01 14:43:58
Message-ID: CA+TgmoZ37JOrh25NG1+Kz6kLajMdxj+v2gvjfXN+g6cQFgqp5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While I've been involved in the design discussions for this patch set,
I haven't looked at any of the code personally in a very long time. I
certainly don't claim to be an independent reviewer, and I encourage
others to review this work also. That said, here are some review
comments.

I decided to start with 0005, as that has the user-facing
documentation for this feature. There is a spurious whitespace-only
hunk in monitoring.sgml.

+ <entry>Process ID of the backend currently attached to this undo log
+ for writing.</entry>

or NULL/0/something if none?

+ each undo log that exists. Undo logs are extents within a contiguous
+ addressing space that have their own head and tail pointers.

This sentence seems to me to have so little detail that it's not going
to help anyone, and it also seems somewhat out-of-place here. I think
it would be better to link to the longer explanation in the new
storage section instead.

+ Each backend that has written undo data is associated with one or more undo

extra space

+<para>
+Undo logs hold data that is used for rolling back and for implementing
+MVCC in access managers that are undo-aware (currently "zheap"). The storage
+format of undo logs is optimized for reusing existing files.
+</para>

I think the mention of zheap should be removed here since the hope is
that the undo stuff can be committed independently of and prior to
zheap.

I think you mean access methods, not access managers. I suggest
making that an xref.

Maybe add a little more detail, e.g.

Undo logs provide a place for access methods to store data that can be
used to perform necessary cleanup operations after a transaction
abort. The data will be retained after a transaction abort until the
access method successfully performs the required cleanup operations.
After a transaction commit, undo data will be retained until the
transaction is all-visible. This makes it possible for access
managers to use undo data to implement MVCC. Since it most cases undo
data is discarded very quickly, the undo system has been optimized to
minimize writes to disk and to reuse existing files efficiently.

+<para>
+Undo data exists in a 64 bit address space broken up into numbered undo logs
+that represent 1TB extents, for efficient management. The space is further
+broken up into 1MB segment files, for physical storage. The name of each file
+is the address of of the first byte in the file, with a period inserted after
+the part that indicates the undo log number.
+</para>

I cannot read this section and know what an undo filename is going to
look like. Also, the remarks about efficient management seems like it
might be unclear to someone not already familiar with how this works.
Maybe something like:

Undo data exists in a 64-bit address space divided into 2^34 undo
logs, each with a theoretical capacity of 1TB. The first time a
backend writes undo, it attaches to an existing undo log whose
capacity is not yet exhausted and which is not currently being used by
any other backend; or if no suitable undo log already exists, it
creates a new one. To avoid wasting space, each undo log is further
divided into 1MB segment files, so that segments which are no longer
needed can be removed (possibly recycling the underlying file by
renaming it) and segments which are not yet needed do not need to be
physically created on disk. An undo segment file has a name like
<example>, where <thing> is the undo log number and <thang> is the
segment number.

I think it's good to spell out the part about attaching to undo logs
here, because when people look at pg_undo, the number of files will be
roughly proportional to the number of backends, and we should try to
help them understand - at least in general terms - why that happens.

+<para>
+Just as relations can have one of the three persistence levels permanent,
+unlogged or temporary, the undo data that is generated by modifying them must
+be stored in an undo log of the same persistence level. This enables the
+undo data to be discarded at appropriate times along with the relations that
+reference it.
+</para>

This is not quite general, because we're not necessarily talking about
modifications to the files. In fact, in this POC, we're explicitly
talking about the cleanup of the files themselves. Also, it's not
technically correct to say that the persistence level has to match.
You could put everything in permanent undo logs. It would just suck.

Moving on to 0003, the developer documentation:

+The undo log subsystem provides a way to store data that is needed for
+a limited time. Undo data is generated whenever zheap relations are
+modified, but it is only useful until (1) the generating transaction
+is committed or rolled back and (2) there is no snapshot that might
+need it for MVCC purposes. See src/backend/access/zheap/README for
+more information on zheap. The undo log subsystem is concerned with

Again, I think this should be rewritten to make it independent of
zheap. We hope that this facility is not only usable by but will
actually be used by other AMs.

+their location within a 64 bit address space. Unlike redo data, the
+addressing space is internally divided up unto multiple numbered logs.

Except it's not totally unlike; cf. the log and seg arguments to
XLogFileNameById. The xlog division is largely a historical accident
of having to support systems with 32-bit arithmetic and has minimal
consequences in practice, and it's a lot less noticeable now than it
used to be, but it does still kinda exist. I would try to sharpen
this wording a bit to de-emphasize the contrast over whether a log/seg
distinction exists and instead just contrast multiple insertion points
vs. a single one.

+level code (zheap) is largely oblivious to this internal structure and

Another zheap reference.

+eviction provoked by memory pressure, then no disk IO is generated.

I/O?

+Keeping the undo data physically separate from redo data and accessing
+it though the existing shared buffers mechanism allows it to be
+accessed efficiently for MVCC purposes.

And also non-MVCC purposes. I mean, it's not very feasible to do
post-abort cleanup driven solely off the WAL, because the WAL segments
might've been archived or recycled and there's no easy way to access
the bits we want. Saying this is for MVCC purposes specifically seems
misleading.

+shared memory and can be inspected in the pg_stat_undo_logs view. For

Replace "in" with "via" or "through" or something?

+shared memory and can be inspected in the pg_stat_undo_logs view. For
+each undo log, a set of properties called the undo log's meta-data are
+tracked:

"called the undo log's meta-data" seems a bit awkward.

+* the "discard" pointer; data before this point has been discarded
+* the "insert" pointer: new data will be written here
+* the "end" pointer: a new undo segment file will be needed at this point

why ; for the first and : for the others?

+The three pointers discard, insert and end move strictly forwards
+until the whole undo log has been exhausted. At all times discard <=
+insert <= end. When discard == insert, the undo log is empty

I think you should either remove "discard, insert and end" from this
sentence, relying on people to remember the list they just read, or
else punctuate it like this: The three pointers -- discard, insert,
and end -- move...

+logs are held in a fixed-sized pool in shared memory. The size of
+the array is a multiple of max_connections, and limits the total size of
+transactions.

I think you should elaborate on "limits the total size of transactions."

+The meta-data for all undo logs is written to disk at every
+checkpoint. It is stored in files under PGDATA/pg_undo/, using the

Even unlogged and temporary undo logs?

+level of the relation being modified and the current value of the GUC

Suggest: the corresponding relation

+suitable undo log must be either found or created. The system should
+stabilize on one undo log per active writing backend (or more if
+different tablespaces are persistence levels are used).

Won't edge effects drive the number up considerably?

+and they cannot be accessed by other backend including undo workers.

Grammar. Also, begs the question "so how does this work if the undo
workers are frozen out?"

+Responsibility for WAL-logging the contents of the undo log lies with
+client code (ie zheap). While undolog.c WAL-logs all meta-data

Another zheap reference.

+hard coded to use md.c unconditionally, PostgreSQL 12 routes IO for the undo

Suggest I/O rather than IO.

I'll see if I can find time to actually review some of this code at
some point. Regarding 0006, I can't help but notice that it is
completely devoid of documentation and README updates, which will not
do. Regarding 0007, that's an impressively small patch.

...Robert

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-11-01 14:53:03 Re: PG vs macOS Mojave
Previous Message Karsten Hilbert 2018-11-01 14:42:57 Re: backend crash on DELETE, reproducible locally