Re: POC: Cleaning up orphaned files using undo logs

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-09-16 05:52:25
Message-ID: CA+hUKG+MgYoOn55gG9ppw8=vvEazGWQLtZ2kkPBZ9gN22ki1wQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 16, 2019 at 5:27 AM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> While testing zheap over undo apis, we've found the following
> issues/scenarios that might need some fixes/discussions:

Thanks!

> 1. In UndoLogAllocateInRecovery, when we find the current log number
> from the list of registered blocks, we don't check whether the
> block->in_use flag is true or not. In XLogResetInsertion, we just
> reset in_use flag without reseting the blocks[]->rnode information.
> So, if we don't check the in_use flag, it's possible that we'll
> consult some block information from the previous WAL record. IMHO,
> just adding an in_use check in UndoLogAllocateInRecovery will solve
> the problem.

Agreed. I added a line to break out of that loop if !block->in_use.

BTW I am planning to simplify that code considerably, based on a plan
to introduce a new rule: there can be only one undo record and
therefore only one undo allocation per WAL record.

> 2. A transaction, inserts one undo record and generated a WAL record
> for the same, say at WAL location 0/2000A000. Next, the undo record
> gets discarded and WAL is generated to update the meta.discard pointer
> at location 0/2000B000 At the same time, an ongoing checkpoint with
> checkpoint.redo at 0/20000000 flushes the latest meta.discard pointer.
> Now, the system crashes.
> Now, the recovery starts from the location 0/20000000. When the
> recovery of 0/2000A000 happens, it sees the undo record that it's
> about to insert, is already discarded as per meta.discard (flushed by
> checkpoint). In this case, should we just skip inserting the undo
> record?

I see two options:

1. We make it so that if you're allocating in recovery and discard >
insert, we'll just set discard = insert so you can proceed. The code
in undofile_get_segment_file() already copes with missing files during
recovery.

2. We skip the insert as you said.

I think option 1 is probably best, otherwise you have to cope with
failure to insert by skipping, as you said.

> 3. Currently, we create a backup image of the unlogged part of the
> undo log's metadata only when some backend allocates some space from
> the undo log (in UndoLogAllocate). This helps us restore the unlogged
> meta part after a checkpoint.
> When we perform an undo action, we also update the undo action
> progress and emit an WAL record. The same operation can performed by
> the undo worker which doesn't allocate any space from the undo log.
> So, if an undo worker emits an WAL record to update undo action
> progress after a checkpoint, it'll not be able to WAL log the backup
> image of the meta unlogged part. IMHO, this breaks the recovery logic
> of unlogged part of undo meta.

I thought that was OK because those undo data updates don't depend on
the insert pointer. But I see what you mean: the next modification of
the page that DOES depend on the insert pointer might not log the
meta-data if it's not the first WAL record to touch it after a
checkpoint. Rats. I'll have to think about that some more.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message nilsocket 2019-09-16 05:53:33 Re:
Previous Message nilsocket 2019-09-16 05:41:31