Re: POC: Cleaning up orphaned files using undo logs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-07-29 08:48:30
Message-ID: 20190729084830.l4yliidwrkqywp42@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

On 2019-06-26 01:29:57 +0530, Amit Kapila wrote:
> From 67845a7afa675e973bd0ea9481072effa1eb219d Mon Sep 17 00:00:00 2001
> From: Dilip Kumar <dilipkumar(at)localhost(dot)localdomain>
> Date: Wed, 24 Apr 2019 14:36:28 +0530
> Subject: [PATCH 05/14] Add prefetch support for the undo log
>
> Add prefetching function for undo smgr and also provide mechanism
> to prefetch without relcache.

> +#ifdef USE_PREFETCH
> /*
> - * PrefetchBuffer -- initiate asynchronous read of a block of a relation
> + * PrefetchBufferGuts -- Guts of prefetching a buffer.

> * No-op if prefetching isn't compiled in.

This isn't true for the this function, as you've defined it?

>
> diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c
> index 2aa4952..14ccc52 100644
> --- a/src/backend/storage/smgr/undofile.c
> +++ b/src/backend/storage/smgr/undofile.c
> @@ -117,7 +117,18 @@ undofile_extend(SMgrRelation reln, ForkNumber forknum,
> void
> undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
> {
> - elog(ERROR, "undofile_prefetch is not supported");
> +#ifdef USE_PREFETCH
> + File file;
> + off_t seekpos;
> +
> + Assert(forknum == MAIN_FORKNUM);
> + file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
> + seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
> +
> + Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
> +
> + (void) FilePrefetch(file, seekpos, BLCKSZ, WAIT_EVENT_UNDO_FILE_PREFETCH);
> +#endif /* USE_PREFETCH */
> }

This looks like it should be part of the commit that introduces
undofile_prefetch(), rather than separately? Afaics there's no reason to
have it in this commit.

> From 7206c40e4cee3391c537cdb22c854889bb417d0e Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> Date: Wed, 6 Mar 2019 16:46:04 +1300
> Subject: [PATCH 03/14] Add undo log manager.

> +/*
> + * If the caller doesn't know the the block_id, but does know the RelFileNode,
> + * forknum and block number, then we try to find it.
> + */
> +XLogRedoAction
> +XLogReadBufferForRedoBlock(XLogReaderState *record,
> + SmgrId smgrid,
> + RelFileNode rnode,
> + ForkNumber forknum,
> + BlockNumber blockno,
> + ReadBufferMode mode,
> + bool get_cleanup_lock,
> + Buffer *buf)

I find that a somewhat odd function comment. Nor does the function name
tell me much. A buffer is always block sized. And you pass in a block
number.

> @@ -347,7 +409,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
> * Make sure that if the block is marked with WILL_INIT, the caller is
> * going to initialize it. And vice versa.
> */
> - zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
> + zeromode = (mode == RBM_ZERO || mode == RBM_ZERO_AND_LOCK ||
> + mode == RBM_ZERO_AND_CLEANUP_LOCK);
> willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
> if (willinit && !zeromode)
> elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
> @@ -463,7 +526,7 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, ForkNumber forknum,
> {
> /* page exists in file */
> buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum, blkno,
> - mode, NULL);
> + mode, NULL, RELPERSISTENCE_PERMANENT);
> }
> else
> {
> @@ -488,7 +551,8 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, ForkNumber forknum,
> ReleaseBuffer(buffer);
> }
> buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum,
> - P_NEW, mode, NULL);
> + P_NEW, mode, NULL,
> + RELPERSISTENCE_PERMANENT);
> }
> while (BufferGetBlockNumber(buffer) < blkno);
> /* Handle the corner case that P_NEW returns non-consecutive pages */
> @@ -498,7 +562,8 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, ForkNumber forknum,
> LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> ReleaseBuffer(buffer);
> buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum, blkno,
> - mode, NULL);
> + mode, NULL,
> + RELPERSISTENCE_PERMANENT);
> }
> }

Not this patches fault, but it strikes me as a bad idea to just hardcode
RELPERSISTENCE_PERMANENT. E.g. it can totally make sense to WAL log some
records for an unlogged table, e.g. to create the init fork.

> +/*
> + * Main control structure for undo log management in shared memory.
> + * UndoLogSlot objects are arranged in a fixed-size array, with no particular
> + * ordering.
> + */
> +typedef struct UndoLogSharedData
> +{
> + UndoLogNumber free_lists[UndoPersistenceLevels];
> + UndoLogNumber low_logno;
> + UndoLogNumber next_logno;
> + UndoLogNumber nslots;
> + UndoLogSlot slots[FLEXIBLE_ARRAY_MEMBER];
> +} UndoLogSharedData;

Would be good to document at least low_logno - at least to me it's not
obvious what that means by name. Also some higher level comments about
what the shared memory layout is wouldn't hurt.

> +/*
> + * How many undo logs can be active at a time? This creates a theoretical
> + * maximum amount of undo data that can exist, but if we set it to a multiple
> + * of the maximum number of backends it will be a very high limit.
> + * Alternative designs involving demand paging or dynamic shared memory could
> + * remove this limit but would be complicated.
> + */
> +static inline size_t
> +UndoLogNumSlots(void)
> +{
> + return MaxBackends * 4;
> +}

I'd put this factor in a macro (or named integer constant
variable). It's a) nice to have all such numbers defined in one place b)
it makes it easier to understand where the four comes from.

> +/*
> + * Initialize the undo log subsystem. Called in each backend.
> + */
> +void
> +UndoLogShmemInit(void)
> +{
> + bool found;
> +
> + UndoLogShared = (UndoLogSharedData *)
> + ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found);
> +
> + /* The postmaster initialized the shared memory state. */
> + if (!IsUnderPostmaster)
> + {
> + int i;
> +
> + Assert(!found);

I don't quite understand putting this under IsUnderPostmaster, rather
than found (and then potentially having an IsUnderPostmaster assert). I
know that a few other places do it this way too.

> +/*
> + * Iterate through the set of currently active logs. Pass in NULL to get the
> + * first undo log.

Not a fan of APIs like this. Harder to understand at callsites.

> NULL indicates the end of the set of logs.

+ "A return value of"? Right now this sounds a bit like it's referencing
the NULL argument.

> The caller
> + * must lock the returned log before accessing its members, and must skip if
> + * logno is not valid.
> + */
> +UndoLogSlot *
> +UndoLogNextSlot(UndoLogSlot *slot)
> +{

> +/*
> + * Create a new empty segment file on disk for the byte starting at 'end'.
> + */
> +static void
> +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
> + UndoLogOffset end)
> +{
> + struct stat stat_buffer;
> + off_t size;
> + char path[MAXPGPATH];
> + void *zeroes;
> + size_t nzeroes = 8192;
> + int fd;
> +
> + UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path);
> +
> + /*
> + * Create and fully allocate a new file. If we crashed and recovered
> + * then the file might already exist, so use flags that tolerate that.
> + * It's also possible that it exists but is too short, in which case
> + * we'll write the rest. We don't really care what's in the file, we
> + * just want to make sure that the filesystem has allocated physical
> + * blocks for it, so that non-COW filesystems will report ENOSPC now
> + * rather than later when the space is needed and we'll avoid creating
> + * files with holes.
> + */
> + fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);

As I said somewhere nearby, I think it might make sense to optionally
first fallocate, and then zero.

Is there agood reason to not just use O_TRUNC here, and enter the
zeroing path without a stat? We could potentially end up with holes this
way, I think (if the writes didn't make it to disk, but the metadata
operation did). Also seems better to just start from a consistently
zeroed out block, rather than sometimes having old data in there.

> + /*
> + * If we're not in recovery, we need to WAL-log the creation of the new
> + * file(s). We do that after the above filesystem modifications, in
> + * violation of the data-before-WAL rule as exempted by
> + * src/backend/access/transam/README. This means that it's possible for
> + * us to crash having made some or all of the filesystem changes but
> + * before WAL logging, but in that case we'll eventually try to create the
> + * same segment(s) again, which is tolerated.
> + */

Perhaps explain *why* the rule is violated here?

> +/*
> + * Advance the insertion pointer in this context by 'size' usable (non-header)
> + * bytes. This is the next place we'll try to allocate a record, if it fits.
> + * This is not committed to shared memory until after we've WAL-logged the
> + * record and UndoLogAdvanceFinal() is called.
> + */
> +void
> +UndoLogAdvance(UndoLogAllocContext *context, size_t size)
> +{
> + context->try_location = UndoLogOffsetPlusUsableBytes(context->try_location,
> + size);
> +}
> +
> +/*
> + * Advance the insertion pointer to 'size' usable (non-header) bytes past
> + * insertion_point.
> + */
> +void
> +UndoLogAdvanceFinal(UndoRecPtr insertion_point, size_t size)

Think this should differentiate from UndoLogAdvance().

> + /*
> + * We acquire UndoLogLock to prevent any undo logs from being created or
> + * discarded while we build a snapshot of them. This isn't expected to
> + * take long on a healthy system because the number of active logs should
> + * be around the number of backends. Holding this lock won't prevent
> + * concurrent access to the undo log, except when segments need to be
> + * added or removed.
> + */
> + LWLockAcquire(UndoLogLock, LW_SHARED);

s/the undo log/undo logs/?

> + /* Dump into a file under pg_undo. */
> + snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
> + checkPointRedo);
> + pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE);
> + fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
> + if (fd < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not create file \"%s\": %m", path)));
> +
> + /* Compute header checksum. */
> + INIT_CRC32C(crc);
> + COMP_CRC32C(crc, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno));
> + COMP_CRC32C(crc, &UndoLogShared->next_logno, sizeof(UndoLogShared->next_logno));
> + COMP_CRC32C(crc, &num_logs, sizeof(num_logs));
> + FIN_CRC32C(crc);
> +
> + /* Write out the number of active logs + crc. */
> + if ((write(fd, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno)) != sizeof(UndoLogShared->low_logno)) ||
> + (write(fd, &UndoLogShared->next_logno, sizeof(UndoLogShared->next_logno)) != sizeof(UndoLogShared->next_logno)) ||
> + (write(fd, &num_logs, sizeof(num_logs)) != sizeof(num_logs)) ||
> + (write(fd, &crc, sizeof(crc)) != sizeof(crc)))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not write to file \"%s\": %m", path)));

I'd prefix it with some magic value. It provides a way to do version
bumps if really necessary (or just provide an explicit version), makes
it easier to distinguish proper checksum failures from zeroed out files,
and helps identify the files after FS corruption.

> + /* Write out the meta data for all active undo logs. */
> + data = (char *) serialized;
> + INIT_CRC32C(crc);
> + serialized_size = num_logs * sizeof(UndoLogMetaData);
> + while (serialized_size > 0)
> + {
> + ssize_t written;
> +
> + written = write(fd, data, serialized_size);
> + if (written < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not write to file \"%s\": %m", path)));
> + COMP_CRC32C(crc, data, written);
> + serialized_size -= written;
> + data += written;
> + }
> + FIN_CRC32C(crc);
> +
> + if (write(fd, &crc, sizeof(crc)) != sizeof(crc))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not write to file \"%s\": %m", path)));
> +

The number of small writes here makes me wonder if this shouldn't either
use fopen/write or a manual buffer.

> + /* Flush file and directory entry. */
> + pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC);
> + pg_fsync(fd);
> + if (CloseTransientFile(fd) < 0)
> + ereport(data_sync_elevel(ERROR),
> + (errcode_for_file_access(),
> + errmsg("could not close file \"%s\": %m", path)));
> + fsync_fname("pg_undo", true);
> + pgstat_report_wait_end();

Is there a risk in crashing during this, and leaving an incomplete file
in place? Presumably not, because the checkpoint wouldn't exist?

> +/*
> + * Find the UndoLogSlot object for a given log number.
> + *
> + * The caller may or may not already hold UndoLogLock, and should indicate
> + * this by passing 'locked'. We'll acquire it in the slow path if necessary.
> + * If it is not held by the caller, the caller must deal with the possibility
> + * that the returned UndoLogSlot no longer contains the requested logno by the
> + * time it is accessed.
> + *
> + * To do that, one of the following approaches must be taken by the calling
> + * code:
> + *
> + * 1. If the calling code knows that it is attached to this lock or is the

*this "log", not "lock", right?

> +static void
> +attach_undo_log(UndoPersistence persistence, Oid tablespace)
> +{
> + UndoLogSlot *slot = NULL;
> + UndoLogNumber logno;
> + UndoLogNumber *place;
> +
> + Assert(!InRecovery);
> + Assert(CurrentSession->attached_undo_slots[persistence] == NULL);
> +
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> +
> + /*
> + * For now we have a simple linked list of unattached undo logs for each
> + * persistence level. We'll grovel though it to find something for the
> + * tablespace you asked for. If you're not using multiple tablespaces
> + * it'll be able to pop one off the front. We might need a hash table
> + * keyed by tablespace if this simple scheme turns out to be too slow when
> + * using many tablespaces and many undo logs, but that seems like an
> + * unusual use case not worth optimizing for.
> + */
> + place = &UndoLogShared->free_lists[persistence];
> + while (*place != InvalidUndoLogNumber)
> + {
> + UndoLogSlot *candidate = find_undo_log_slot(*place, true);
> +
> + /*
> + * There should never be an undo log on the freelist that has been
> + * entirely discarded, or hasn't been created yet. The persistence
> + * level should match the freelist.
> + */
> + if (unlikely(candidate == NULL))
> + elog(ERROR,
> + "corrupted undo log freelist, no such undo log %u", *place);
> + if (unlikely(candidate->meta.persistence != persistence))
> + elog(ERROR,
> + "corrupted undo log freelist, undo log %u with persistence %d found on freelist %d",
> + *place, candidate->meta.persistence, persistence);
> +
> + if (candidate->meta.tablespace == tablespace)
> + {
> + logno = *place;
> + slot = candidate;
> + *place = candidate->next_free;
> + break;
> + }
> + place = &candidate->next_free;
> + }

I'd replace the linked list with ilist.h ones.

< more tomorrow >

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2019-07-29 09:15:36 Re: ANALYZE: ERROR: tuple already updated by self
Previous Message Masahiko Sawada 2019-07-29 08:36:59 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)