Re: Undo logs

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: dilip(dot)kumar(at)enterprisedb(dot)com, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Undo logs
Date: 2018-10-17 09:57:54
Message-ID: CAA4eK1LDctrYeZ8ev1N1v-8KwiigAmNMx=t-UTs9qgEFt+P0XQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > I have also pushed a new WIP version of the lower level undo log
> > storage layer patch set to a public branch[1]. I'll leave the earlier
> > branch[2] there because the record-level patch posted by Dilip depends
> > on it for now.
> >
>
> I have started reading the patch and have a few assorted comments
> which are mentioned below. I have been involved in the high-level
> design of this module and I have also shared given some suggestions
> during development, but this is mainly Thomas's work with some help
> from Dilip. It would be good if other members of the community also
> review the design or participate in the discussion.
>
> Comments
> ------------------
>

Some more comments/questions on the design level choices you have made
in this patch and some general comments.

1. To allocate an undo log (UndoLogAllocate()), it seems first we are
creating the shared memory state for an undo log, write a WAL for it,
create an actual file and segment in it and write a separate WAL for
it. Now imagine the system crashed after creating a shared memory
state and before actually allocating an undo log segment, then it is
quite possible that after recovery we will block multiple slots for
undo logs without having actual undo logs for them. Apart from that
writing separate WAL for them doesn't appear to be the best way to
deal with it considering that we also need to write a third WAL to
attach an undo log.

Now, IIUC, one advantage of arranging the things this way is that we
avoid dropping the tablespaces when a particular undo log exists in
it. I understand that this design kind of works, but I think we
should try to think of some alternatives here. You might have already
thought of making it work similar to how the interaction for regular
tables or temp_tablespaces works with dropping the tablespaces but
decided to do something different here. Can you explain why you have
made a different design choice here?

2.
extend_undo_log()
{
..
+ /*
+ * Flush the parent dir so that the directory metadata survives a crash
+ * after this point.
+
*/
+ UndoLogDirectory(log->meta.tablespace, dir);
+ fsync_fname(dir, true);
+
+ /*
+ * If we're not in
recovery, we need to WAL-log the creation of the new
+ * file(s). We do that after the above filesystem
modifications, in
+ * violation of the data-before-WAL rule as exempted by
+ *
src/backend/access/transam/README. This means that it's possible for
+ * us to crash having made some or all of the
filesystem changes but
+ * before WAL logging, but in that case we'll eventually try to create the
+ * same
segment(s) again, which is tolerated.
+ */
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+
XLogRecPtr ptr;
..
}

I don't understand this WAL logging action. If the crash happens
before or during syncing the file, then we anyway don't have WAL to
replay. If it happens after WAL writing, then anyway we are sure that
the extended undo log segment must be there. Can you explain how this
works?

3.
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
}

What will happen if the transaction creating undo log segment rolls
back? Do we want to have pendingDeletes stuff as we have for normal
relation files? This might also help in clearing the shared memory
state (undo log slots) if any.

4.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
/*
+ * Take the tablespace create/drop lock while we look the name up.
+ * This prevents the
tablespace from being dropped while we're trying
+ * to resolve the name, or while the called is trying
to create an
+ * undo log in it. The caller will have to release this lock.
+ */
+
LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
..

This appears quite expensive, for selecting an undo log to attach, we
might need to wait for an unrelated tablespace create/drop. Have you
considered any other ideas to prevent this? How other callers of
get_tablespace_oid prevent it from being dropped? If we don't find
any better solution, then I think at the very least we should start a
separate thread to know the opinion of others on this matter. I think
this is somewhat related to point-1.

5.
+static inline Oid
+UndoRecPtrGetTablespace(UndoRecPtr urp)
+{
+ UndoLogNumber logno = UndoRecPtrGetLogNo
(urp);
+ UndoLogTableEntry *entry;
+
+ /*
+ * Fast path, for undo logs we've seen before. This is safe because
+
* tablespaces are constant for the lifetime of an undo log number.
+ */
+ entry = undologtable_lookup
(undologtable_cache, logno);
+ if (likely(entry))
+ return entry->tablespace;
+
+ /*
+ * Slow path:
force cache entry to be created. Raises an error if the
+ * undo log has been entirely discarded, or hasn't
been created yet. That
+ * is appropriate here, because this interface is designed for accessing
+ *
undo pages via bufmgr, and we should never be trying to access undo
+ * pages that have been discarded.
+ */
+
UndoLogGet(logno, false);

It seems UndoLogGet() probes hash table first, so what is the need for
doing it in the caller and if you think it is better to perform in the
caller, then maybe we should avoid doing it inside
UndoLogGet()->get_undo_log()->undologtable_lookup().

6.
+get_undo_log(UndoLogNumber logno, bool locked)
{
..
+ /*
+ * If we didn't find it, then it must already have been entirely
+ *
discarded. We create a negative cache entry so that we can answer
+ * this question quickly next time.
+
*
+ * TODO: We could track the lowest known undo log number, to reduce
+ * the
negative cache entry bloat.
+ */
+ if (result == NULL)
+ {
+ /*
+
* Sanity check: the caller should not be asking about undo logs
+ * that have
never existed.
+ */
+ if (logno >= shared->next_logno)
+
elog(ERROR, "undo log %u hasn't been created yet", logno);
+ entry = undologtable_insert
(undologtable_cache, logno, &found);
+ entry->number = logno;
+ entry->control =
NULL;
+ entry->tablespace = 0;
+ }
..
}

Are you planning to take care of this TODO? In any case, do we have
any mechanism to clear this bloat or will it stay till the end of the
session? If it is later, then I think it is important to take care of
TODO.

7.
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void
undolog_redo(XLogReaderState *record);

You might want to add an extra line before /* Redo interface. */
following what has been done earlier in this file.

8.
+ * XXX For now an xl_undolog_meta object is filled in, in case it turns out
+ * to be necessary to write it into the
WAL record (like FPI, this must be
+ * logged once for each undo log after each checkpoint). I think this should
+ *
be moved out of this interface and done differently -- to review.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size,
UndoPersistence persistence)

This function doesn't appear to be filling xl_undolog_meta. Am I
missing something, if not, then this comments needs to be changed?

9.
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+ char *rawname;
+ List
*namelist;
+ bool need_to_unlock;
+ int length;
+ int i;
+
+ /* We need a
modifiable copy of string. */
+ rawname = pstrdup(undo_tablespaces);

I don't see the usage of rawname outside this function, isn't it
better to free it? I understand that this function won't be called
frequently enough to matter, but still there is some theoretical
danger if a user continuously changes undo_tablespaces.

10.
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
{
..
+ /*
+ * For now we have a simple linked list of unattached undo logs for each
+ * persistence level.
We'll grovel though it to find something for the

Typo.
/though/through

11.
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
{
..
+ /* WAL-log the creation of this new undo log. */
+ {
+
xl_undolog_create xlrec;
+
+ xlrec.logno = logno;
+ xlrec.tablespace = log-
>meta.tablespace;
+ xlrec.persistence = log->meta.persistence;
+
+
XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+ }
..
}

Do we need to WAL log this for temporary/unlogged persistence level?

12.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
+ oid = get_tablespace_oid(name, true);
..

Do we need to check permissions to see if the current user is allowed
to create in this tablespace?

13.
+UndoLogAllocate(size_t size, UndoPersistence persistence)
{
..
+ log->meta.prevlogno = prevlogno;

Is it okay to update meta information without lock or we should do it
few lines down after taking mutex lock? If it is okay, then it is
better to write a comment for the same?

14.
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
+ /* Flush the contents of the file to disk. */
+ if (pg_fsync(fd) != 0)
+ elog(ERROR, "cannot fsync
file \"%s\": %m", path);
..
}

You might want to have a wait event for this as we do have at other
places where we perform fsync.

15.
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+ XLogRecPtr ptr;
+
+
xlrec.logno = logno;
+ xlrec.end = end;
+
+ XLogBeginInsert();
+ XLogRegisterData
((char *) &xlrec, sizeof(xlrec));
+ ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+
XLogFlush(ptr);
+ }
..
}

Do we need it for temporary/unlogged persistence level?

16.
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+ xl_undolog_create *xlrec = (xl_undolog_create *)
XLogRecGetData(record);
+ UndoLogControl *log;
+ UndoLogSharedData *shared = MyUndoLogState.shared;
+
+ /*
Create meta-data space in shared memory. */
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+ /* TODO: assert that
it doesn't exist already? */
+ log = allocate_undo_log();
+ LWLockAcquire(&log->mutex, LW_EXCLUSIVE);

Do we need to acquire UndoLogLock during replay? What else can be
going in concurrent to this which can create a problem?

17.
UndoLogAllocate()
{
..
+ /*
+ * While we have the lock, check if we have been forcibly detached by
+ *
DROP TABLESPACE. That can only happen between transactions (see
+ * DropUndoLogsInsTablespace()).
+
*/
..
}

Function name in above comment is wrong.

18.
+ {
+ {"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+ gettext_noop("Sets the tablespace(s) to use for undo logs."),
+ NULL,
+ GUC_LIST_INPUT | GUC_LIST_QUOTE
+ },
+ &undo_tablespaces,
+ "",
+ check_undo_tablespaces, assign_undo_tablespaces, NULL
+ },

It seems you need to update variable_is_guc_list_quote for this variable.

Till now, I have mainly reviewed undo log allocation part. This is a
big patch and can take much more time to complete the review. I will
review the other parts of the patch later. I have changed the status
of this CF entry as "Waiting on Author", feel free to change it once
you think all the comments are addressed.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hubert Zhang 2018-10-17 10:19:06 Is there any way to request unique lwlock inside a background worker in PG9.4?
Previous Message Victor Wagner 2018-10-17 08:38:46 Perl 5.26 and windows build system