Re: POC: Cleaning up orphaned files using undo logs

From: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-09-16 15:08:58
Message-ID: CAGz5QCJ2LHXPMdkVx=GHanWcU7BeKhMjxzvhmeU7D15UGtYxaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Thomas,

On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> > 1. In UndoLogAllocateInRecovery, when we find the current log number
> > from the list of registered blocks, we don't check whether the
> > block->in_use flag is true or not. In XLogResetInsertion, we just
> > reset in_use flag without reseting the blocks[]->rnode information.
> > So, if we don't check the in_use flag, it's possible that we'll
> > consult some block information from the previous WAL record. IMHO,
> > just adding an in_use check in UndoLogAllocateInRecovery will solve
> > the problem.
>
> Agreed. I added a line to break out of that loop if !block->in_use.
>
I think we should skip the block if !block->in_use. Because, the undo
buffer can be registered in a subsequent block as well. For different
operations, we can use different block_id to register the undo buffer
in the redo record.

> BTW I am planning to simplify that code considerably, based on a plan
> to introduce a new rule: there can be only one undo record and
> therefore only one undo allocation per WAL record.
>
Okay. In that case, we need to rethink the cases for multi-inserts and
non-inlace updates both of which currently inserts multiple undo
record corresponding to a single WAL record. For multi-inserts, it can
be solved easily by moving all the offset information in the payload.
But, for non-inlace updates, we insert one undo record for the update
and one for the insert. Wondering whether we've to insert two WAL
records - one for update and one for the new insert.

> > 2. A transaction, inserts one undo record and generated a WAL record
> > for the same, say at WAL location 0/2000A000. Next, the undo record
> > gets discarded and WAL is generated to update the meta.discard pointer
> > at location 0/2000B000 At the same time, an ongoing checkpoint with
> > checkpoint.redo at 0/20000000 flushes the latest meta.discard pointer.
> > Now, the system crashes.
> > Now, the recovery starts from the location 0/20000000. When the
> > recovery of 0/2000A000 happens, it sees the undo record that it's
> > about to insert, is already discarded as per meta.discard (flushed by
> > checkpoint). In this case, should we just skip inserting the undo
> > record?
>
> I see two options:
>
> 1. We make it so that if you're allocating in recovery and discard >
> insert, we'll just set discard = insert so you can proceed. The code
> in undofile_get_segment_file() already copes with missing files during
> recovery.
>
Interesting. This should work.

>
> > 3. Currently, we create a backup image of the unlogged part of the
> > undo log's metadata only when some backend allocates some space from
> > the undo log (in UndoLogAllocate). This helps us restore the unlogged
> > meta part after a checkpoint.
> > When we perform an undo action, we also update the undo action
> > progress and emit an WAL record. The same operation can performed by
> > the undo worker which doesn't allocate any space from the undo log.
> > So, if an undo worker emits an WAL record to update undo action
> > progress after a checkpoint, it'll not be able to WAL log the backup
> > image of the meta unlogged part. IMHO, this breaks the recovery logic
> > of unlogged part of undo meta.
>
> I thought that was OK because those undo data updates don't depend on
> the insert pointer. But I see what you mean: the next modification of
> the page that DOES depend on the insert pointer might not log the
> meta-data if it's not the first WAL record to touch it after a
> checkpoint. Rats. I'll have to think about that some more.
Cool.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2019-09-16 15:12:04 Re: Default JIT setting in V12
Previous Message Jonathan S. Katz 2019-09-16 14:55:17 Re: Define jsonpath functions as stable