Re: POC: Cleaning up orphaned files using undo logs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-07-29 23:35:20
Message-ID: 20190729233520.swd55m3r7bu2sy5b@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I realize that this might not be the absolutely newest version of the
undo storage part of this patchset - but I'm trying to understand the
whole context, and that's hard without reading through the whole stack
in a situation where the layers actually fit together

On 2019-07-29 01:48:30 -0700, Andres Freund wrote:
> < more tomorrow >

> + /* Move the high log number pointer past this one. */
> + ++UndoLogShared->next_logno;

Fwiw, I find having "next" and "low" as variable names, and then
describing "next" as high in comments somewhat confusing.

> +/* check_hook: validate new undo_tablespaces */
> +bool
> +check_undo_tablespaces(char **newval, void **extra, GucSource source)
> +{
> + char *rawname;
> + List *namelist;
> +
> + /* Need a modifiable copy of string */
> + rawname = pstrdup(*newval);
> +
> + /*
> + * Parse string into list of identifiers, just to check for
> + * well-formedness (unfortunateley we can't validate the names in the
> + * catalog yet).
> + */
> + if (!SplitIdentifierString(rawname, ',', &namelist))
> + {
> + /* syntax error in name list */
> + GUC_check_errdetail("List syntax is invalid.");
> + pfree(rawname);
> + list_free(namelist);
> + return false;
> + }

Why can't you validate the catalog here? In a lot of cases this will be
called in a transaction, especially when changing it in a
session. E.g. temp_tablespaces does so?

> + /*
> + * Make sure we aren't already in a transaction that has been assigned an
> + * XID. This ensures we don't detach from an undo log that we might have
> + * started writing undo data into for this transaction.
> + */
> + if (GetTopTransactionIdIfAny() != InvalidTransactionId)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + (errmsg("undo_tablespaces cannot be changed while a transaction is in progress"))));

Hm. Is this really a great proxy? Seems like it'll block changing the
tablespace unnecessarily in a lot of situations, and like there could
even be holes in the future - it doesn't seem crazy that we'd want to
emit undo without assigning an xid in some situations (e.g. for deleting
files in error cases, or for more aggressive cleanup of dead index
entries during reads or such).

It seems like it'd be pretty easy to just check
CurrentSession->attached_undo_slots[i].slot->meta.unlogged.this_xact_start
or such?

> +static bool
> +choose_undo_tablespace(bool force_detach, Oid *tablespace)
> +{

> + else
> + {
> + /*
> + * Choose an OID using our pid, so that if several backends have the
> + * same multi-tablespace setting they'll spread out. We could easily
> + * do better than this if more serious load balancing is judged
> + * useful.
> + */

We're not really choosing an oid, we're choosing a tablespace. Obviously
one can understand it as is, but it confused me for a second.

> + int index = MyProcPid % length;

Hm. Is MyProcPid a good proxy here? Wouldn't it be better to use
MyProc->pgprocno or such? That's much more guaranteed to space out
somewhat evenly?

> + int first_index = index;
> + Oid oid = InvalidOid;
> +
> + /*
> + * Take the tablespace create/drop lock while we look the name up.
> + * This prevents the tablespace from being dropped while we're trying
> + * to resolve the name, or while the called is trying to create an
> + * undo log in it. The caller will have to release this lock.
> + */
> + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);

Why exclusive?

I think any function that acquires a lock it doesn't release (or the
reverse) ought to have a big honking comment in its header warning of
that. And an explanation as to why that is.

> + for (;;)
> + {
> + const char *name = list_nth(namelist, index);
> +
> + oid = get_tablespace_oid(name, true);
> + if (oid == InvalidOid)
> + {
> + /* Unknown tablespace, try the next one. */
> + index = (index + 1) % length;
> + /*
> + * But if we've tried them all, it's time to complain. We'll
> + * arbitrarily complain about the last one we tried in the
> + * error message.
> + */
> + if (index == first_index)
> + ereport(ERROR,
> + (errcode(ERRCODE_UNDEFINED_OBJECT),
> + errmsg("tablespace \"%s\" does not exist", name),
> + errhint("Create the tablespace or set undo_tablespaces to a valid or empty list.")));
> + continue;

Wouldn't it be better to simply include undo_tablespaces in the error
messages? Something roughly like 'none of the tablespaces in undo_tablespaces =
\"%s\" exists"?

> + /*
> + * If we came here because the user changed undo_tablesaces, then detach
> + * from any undo logs we happen to be attached to.
> + */
> + if (force_detach)
> + {
> + for (i = 0; i < UndoPersistenceLevels; ++i)
> + {
> + UndoLogSlot *slot = CurrentSession->attached_undo_slots[i];
> +
> + if (slot != NULL)
> + {
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + slot->pid = InvalidPid;
> + slot->meta.unlogged.xid = InvalidTransactionId;
> + LWLockRelease(&slot->mutex);

Would it make sense to re-assert here that the current transaction
didn't write undo?

> +bool
> +DropUndoLogsInTablespace(Oid tablespace)
> +{
> + DIR *dir;
> + char undo_path[MAXPGPATH];
> + UndoLogSlot *slot = NULL;
> + int i;
> +
> + Assert(LWLockHeldByMe(TablespaceCreateLock));

IMO this ought to be mentioned in a function header comment.

> + /* First, try to kick everyone off any undo logs in this tablespace. */
> + while ((slot = UndoLogNextSlot(slot)))
> + {
> + bool ok;
> + bool return_to_freelist = false;
> +
> + /* Skip undo logs in other tablespaces. */
> + if (slot->meta.tablespace != tablespace)
> + continue;
> +
> + /* Check if this undo log can be forcibly detached. */
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + if (slot->meta.discard == slot->meta.unlogged.insert &&
> + (slot->meta.unlogged.xid == InvalidTransactionId ||
> + !TransactionIdIsInProgress(slot->meta.unlogged.xid)))
> + {

Not everyone will agree, but this looks complicated enough that I'd put
it just in a simple wrapper function. If this were if
(CanDetachUndoForcibly(slot)) you'd not need a comment either...

Also, isn't the slot->meta.discard == slot->meta.unlogged.insert a
separate concern from detaching? My understanding is that it'll be
perfectly normal to have undo logs with undiscarded data that nobody is
attached to? In fact, I got confused below, because I initially didn't
spot any place that implemented the check referenced in the caller:

> + * Drop the undo logs in this tablespace. This will fail (without
> + * dropping anything) if there are undo logs that we can't afford to drop
> + * because they contain non-discarded data or a transaction is in
> + * progress. Since we hold TablespaceCreateLock, no other session will be
> + * able to attach to an undo log in this tablespace (or any tablespace
> + * except default) concurrently.
> + */
> + if (!DropUndoLogsInTablespace(tablespaceoid))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
> + tablespacename)));
> +
> + /*

> + else
> + {
> + /*
> + * There is data we need in this undo log. We can't force it to
> + * be detached.
> + */
> + ok = false;
> + }

Seems like we ought to return more information here. An error message
like:

> /*
> + * Drop the undo logs in this tablespace. This will fail (without
> + * dropping anything) if there are undo logs that we can't afford to drop
> + * because they contain non-discarded data or a transaction is in
> + * progress. Since we hold TablespaceCreateLock, no other session will be
> + * able to attach to an undo log in this tablespace (or any tablespace
> + * except default) concurrently.
> + */
> + if (!DropUndoLogsInTablespace(tablespaceoid))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
> + tablespacename)));

doesn't really allow a DBA to do anything about the issue. Seems we
ought to at least include the pid in the error message? I'd perhaps
just move the error message from DropTableSpace() into
DropUndoLogsInTablespace(). I don't think that's worse from a layering
perspective, and allows to raise a more precise error, and simplifies
the API.

> + /*
> + * Put this undo log back on the appropriate free-list. No one can
> + * attach to it while we hold TablespaceCreateLock, but if we return
> + * earlier in a future go around this loop, we need the undo log to
> + * remain usable. We'll remove all appropriate logs from the
> + * free-lists in a separate step below.
> + */
> + if (return_to_freelist)
> + {
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> + slot->next_free = UndoLogShared->free_lists[slot->meta.persistence];
> + UndoLogShared->free_lists[slot->meta.persistence] = slot->logno;
> + LWLockRelease(UndoLogLock);
> + }

There's multiple places that put logs onto the freelist. I'd put them
into one small function. Not primarily because it'll be easier to read,
but because it makes it easier to search for places that do so.

> + /*
> + * We detached all backends from undo logs in this tablespace, and no one
> + * can attach to any non-default-tablespace undo logs while we hold
> + * TablespaceCreateLock. We can now drop the undo logs.
> + */
> + slot = NULL;
> + while ((slot = UndoLogNextSlot(slot)))
> + {
> + /* Skip undo logs in other tablespaces. */
> + if (slot->meta.tablespace != tablespace)
> + continue;
> +
> + /*
> + * Make sure no buffers remain. When that is done by
> + * UndoLogDiscard(), the final page is left in shared_buffers because
> + * it may contain data, or at least be needed again very soon. Here
> + * we need to drop even that page from the buffer pool.
> + */
> + forget_undo_buffers(slot->logno, slot->meta.discard, slot->meta.discard, true);
> +
> + /*
> + * TODO: For now we drop the undo log, meaning that it will never be
> + * used again. That wastes the rest of its address space. Instead,
> + * we should put it onto a special list of 'offline' undo logs, ready
> + * to be reactivated in some other tablespace. Then we can keep the
> + * unused portion of its address space.
> + */
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + slot->meta.status = UNDO_LOG_STATUS_DISCARDED;
> + LWLockRelease(&slot->mutex);
> + }

Before I looked up forget_undo_buffers()'s implementation I wrote:

Hm. Iterating through shared buffers several times, especially when
there possibly could be a good sized numbers of undo logs, seems a bit
superfluous. This probably isn't going to be that frequently used in
practice, so it's perhaps ok. But it seems like this might be used when
things are bad (i.e. there's a lot of UNDO).

But I still wonder about that. Especially when there's a lot of UNDO
(most of it not in shared buffers), this could end up doing a *crapton*
of buffer lookups. I'm inclined to think that this case - probably in
contrast to the discard case, would be better served using
DropRelFileNodeBuffers().

> + /* Remove all dropped undo logs from the free-lists. */
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> + for (i = 0; i < UndoPersistenceLevels; ++i)
> + {
> + UndoLogSlot *slot;
> + UndoLogNumber *place;
> +
> + place = &UndoLogShared->free_lists[i];
> + while (*place != InvalidUndoLogNumber)
> + {
> + slot = find_undo_log_slot(*place, true);
> + if (!slot)
> + elog(ERROR,
> + "corrupted undo log freelist, unknown log %u", *place);
> + if (slot->meta.status == UNDO_LOG_STATUS_DISCARDED)
> + *place = slot->next_free;
> + else
> + place = &slot->next_free;
> + }
> + }
> + LWLockRelease(UndoLogLock);

Hm, shouldn't this check that the log is actually in the being-dropped
tablespace?

> +void
> +ResetUndoLogs(UndoPersistence persistence)
> +{

This imo ought to explain why one would want/need to do that. As far as
I can tell this implementation for example wouldn't be correct in all
that many situations, because it e.g. doesn't drop the relevant buffers?

Seems like this would need to assert that persistence isn't PERMANENT?

This is made more "problematic" by the fact that there's no caller for
this in this commit, only being used much later in the series. But I
think the comment should be there anyway. Hard to review (and
understand) otherwise.

Why is it correct not to take any locks here? The caller in 0014 afaict
is when we're already in hot standby, which means people will possibly
read undo?

> + UndoLogSlot *slot = NULL;
> +
> + while ((slot = UndoLogNextSlot(slot)))
> + {
> + DIR *dir;
> + struct dirent *de;
> + char undo_path[MAXPGPATH];
> + char segment_prefix[MAXPGPATH];
> + size_t segment_prefix_size;
> +
> + if (slot->meta.persistence != persistence)
> + continue;
> +
> + /* Scan the directory for files belonging to this undo log. */
> + snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", slot->logno);
> + segment_prefix_size = strlen(segment_prefix);
> + UndoLogDirectory(slot->meta.tablespace, undo_path);
> + dir = AllocateDir(undo_path);
> + if (dir == NULL)
> + continue;
> + while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
> + {
> + char segment_path[MAXPGPATH];
> +
> + if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
> + continue;

I'm perfectly fine with using MAXPGPATH buffers. But I do find it
confusing that in some places you're using dynamic allocations (in some
cases quite repeatedly, like in allocate_empty_undo_segment(), but here
you don't?

Hm, isn't this kinda O(#slot*#total_size_of_undo) due to going over the
whole tablespace for each log?

> + snprintf(segment_path, sizeof(segment_path), "%s/%s",
> + undo_path, de->d_name);
> + elog(DEBUG1, "unlinked undo segment \"%s\"", segment_path);
> + if (unlink(segment_path) < 0)
> + elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
> + }
> + FreeDir(dir);

I think the LOG should be done alternatively do to the DEBUG1, otherwise
it's going to be confusing. Should this really only be a LOG? Going to
be hard to cleanup for a DBA later.

> +Datum
> +pg_stat_get_undo_logs(PG_FUNCTION_ARGS)
> +{
> +#define PG_STAT_GET_UNDO_LOGS_COLS 9
> + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> + TupleDesc tupdesc;
> + Tuplestorestate *tupstore;
> + MemoryContext per_query_ctx;
> + MemoryContext oldcontext;
> + char *tablespace_name = NULL;
> + Oid last_tablespace = InvalidOid;
> + int i;
> +
> + /* check to see if caller supports us returning a tuplestore */
> + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("set-valued function called in context that cannot accept a set")));
> + if (!(rsinfo->allowedModes & SFRM_Materialize))
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("materialize mode required, but it is not " \
> + "allowed in this context")));
> +
> + /* Build a tuple descriptor for our result type */
> + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
> + elog(ERROR, "return type must be a row type");

I wish we'd encapsulate this in one place instead of copying it over and
over.

Imo it's bad style to break error messages over multiple lines, makes it
harder to grep for.

> + /* Scan all undo logs to build the results. */
> + for (i = 0; i < UndoLogShared->nslots; ++i)
> + {
> + UndoLogSlot *slot = &UndoLogShared->slots[i];
> + char buffer[17];
> + Datum values[PG_STAT_GET_UNDO_LOGS_COLS];
> + bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false };
> + Oid tablespace;

Uncommented numbers like '17' for buffer lengths make me nervous.

> + values[0] = ObjectIdGetDatum((Oid) slot->logno);
> + values[1] = CStringGetTextDatum(
> + slot->meta.persistence == UNDO_PERMANENT ? "permanent" :
> + slot->meta.persistence == UNDO_UNLOGGED ? "unlogged" :
> + slot->meta.persistence == UNDO_TEMP ? "temporary" : "<uknown>");

s/uknown/unknown/

> + tablespace = slot->meta.tablespace;
> +
> + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
> + MakeUndoRecPtr(slot->logno, slot->meta.discard));
> + values[3] = CStringGetTextDatum(buffer);
> + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
> + MakeUndoRecPtr(slot->logno, slot->meta.unlogged.insert));
> + values[4] = CStringGetTextDatum(buffer);
> + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
> + MakeUndoRecPtr(slot->logno, slot->meta.end));
> + values[5] = CStringGetTextDatum(buffer);

Makes me wonder if we shouldn't have a type for undo pointers.

> + if (slot->meta.unlogged.xid == InvalidTransactionId)
> + nulls[6] = true;
> + else
> + values[6] = TransactionIdGetDatum(slot->meta.unlogged.xid);
> + if (slot->pid == InvalidPid)
> + nulls[7] = true;
> + else
> + values[7] = Int32GetDatum((int32) slot->pid);
> + switch (slot->meta.status)
> + {
> + case UNDO_LOG_STATUS_ACTIVE:
> + values[8] = CStringGetTextDatum("ACTIVE"); break;
> + case UNDO_LOG_STATUS_FULL:
> + values[8] = CStringGetTextDatum("FULL"); break;
> + default:
> + nulls[8] = true;
> + }

Don't think this'll survive pgindent.

> + /*
> + * Deal with potentially slow tablespace name lookup without the lock.
> + * Avoid making multiple calls to that expensive function for the
> + * common case of repeating tablespace.
> + */
> + if (tablespace != last_tablespace)
> + {
> + if (tablespace_name)
> + pfree(tablespace_name);
> + tablespace_name = get_tablespace_name(tablespace);
> + last_tablespace = tablespace;
> + }

If we need to do this repeatedly, I think we ought to add a syscache for
tablespace names.

> + if (tablespace_name)
> + {
> + values[2] = CStringGetTextDatum(tablespace_name);
> + nulls[2] = false;
> + }
> + else
> + nulls[2] = true;
> +
> + tuplestore_putvalues(tupstore, tupdesc, values, nulls);

Seems like a CHECK_FOR_INTERRUPTS() in this loop wouldn't hurt.

> + }
> +
> + if (tablespace_name)
> + pfree(tablespace_name);

That seems a bit superflous, given we're leaking plenty other memory
(which is perfectly fine).

> +/*
> + * replay the creation of a new undo log
> + */
> +static void
> +undolog_xlog_create(XLogReaderState *record)
> +{
> + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
> + UndoLogSlot *slot;
> +
> + /* Create meta-data space in shared memory. */
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> +
> + /* TODO: assert that it doesn't exist already? */
> + slot = allocate_undo_log_slot();

Doesn't this need some error checking? allocate_undo_log_slot() will
return NULL if there's no slots left. E.g. restarting a server with a
lower max_connections could have one run into this easily?

> +/*
> + * Drop all buffers for the given undo log, from the old_discard to up
> + * new_discard. If drop_tail is true, also drop the buffer that holds
> + * new_discard; this is used when discarding undo logs completely, for example
> + * via DROP TABLESPACE. If it is false, then the final buffer is not dropped
> + * because it may contain data.
> + *
> + */
> +static void
> +forget_undo_buffers(int logno, UndoLogOffset old_discard,
> + UndoLogOffset new_discard, bool drop_tail)
> +{
> + BlockNumber old_blockno;
> + BlockNumber new_blockno;
> + RelFileNode rnode;
> +
> + UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard));
> + old_blockno = old_discard / BLCKSZ;
> + new_blockno = new_discard / BLCKSZ;
> + if (drop_tail)
> + ++new_blockno;
> + while (old_blockno < new_blockno)
> + {

Hm. This'll be quite bad if you have a lot more undo than
shared_buffers. Taking the partition lwlocks this many times will
hurt. OTOH, scanning all of shared buffers everytime we truncate a few
hundred bytes of undo away is obviously also not going to work.

> + ForgetBuffer(SMGR_UNDO, rnode, UndoLogForkNum, old_blockno);
> + ForgetLocalBuffer(SMGR_UNDO, rnode, UndoLogForkNum, old_blockno++);

This seems odd to me - why do we need to scan both? We ought to know
which one is needed, right?

> + }

> +}
> +/*
> + * replay an undo segment discard record
> + */

Missing newline between functions.

> +static void
> +undolog_xlog_discard(XLogReaderState *record)
> +{
> + /*
> + * We're about to discard undologs. In Hot Standby mode, ensure that
> + * there's no queries running which need to get tuple from discarded undo.

nitpick: s/undologs/undo logs/? I think most other comments split it?

> + * XXX we are passing empty rnode to the conflict function so that it can
> + * check conflict in all the backend regardless of which database the
> + * backend is connected.
> + */
> + if (InHotStandby && TransactionIdIsValid(xlrec->latestxid))
> + ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode);

Hm. Perhaps it'd be better to change
ResolveRecoveryConflictWithSnapshot's API to just accept the
database (OTOH, we perhaps ought to be more granular in conflict
processing). Or just mention that it's ok to pass in an invalid rnode?

> + /*
> + * See if we need to unlink or rename any files, but don't consider it an
> + * error if we find that files are missing. Since UndoLogDiscard()
> + * performs filesystem operations before WAL logging or updating shmem
> + * which could be checkpointed, a crash could have left files already
> + * deleted, but we could replay WAL that expects the files to be there.
> + */

Or we could have crashed/restarted during WAL replay and processing the
same WAL again. Not sure if that's worth mentioning.

> + /* Unlink or rename segments that are no longer in range. */
> + while (old_segment_begin < new_segment_begin)
> + {
> + char discard_path[MAXPGPATH];
> +
> + /* Tell the checkpointer that the file is going away. */
> + undofile_forget_sync(slot->logno,
> + old_segment_begin / UndoLogSegmentSize,
> + slot->meta.tablespace);
> +
> + UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize,
> + slot->meta.tablespace, discard_path);
> +
> + /* Can we recycle the oldest segment? */
> + if (end < xlrec->end)
> + {
> + char recycle_path[MAXPGPATH];
> +
> + UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize,
> + slot->meta.tablespace, recycle_path);
> + if (rename(discard_path, recycle_path) == 0)
> + {
> + elog(DEBUG1, "recycled undo segment \"%s\" -> \"%s\"",
> + discard_path, recycle_path);
> + end += UndoLogSegmentSize;
> + }
> + else
> + {
> + elog(LOG, "could not rename \"%s\" to \"%s\": %m",
> + discard_path, recycle_path);
> + }
> + }
> + else
> + {
> + if (unlink(discard_path) == 0)
> + elog(DEBUG1, "unlinked undo segment \"%s\"", discard_path);
> + else
> + elog(LOG, "could not unlink \"%s\": %m", discard_path);
> + }
> + old_segment_begin += UndoLogSegmentSize;
> + }

The code to recycle or delete one segment exists in multiple places (at
least also in UndoLogDiscard()). Think it's long enough that it's easily
worthwhile to share.

> +/*

> @@ -1418,12 +1418,18 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
> segmentpath = strstr(filename, ".");
> if (segmentpath != NULL)
> {
> - segmentno = atoi(segmentpath + 1);
> - if (segmentno == 0)
> + char *end;
> + if (strstr(readfilename, "undo"))
> + first_blkno = strtol(segmentpath + 1, &end, 16) / BLCKSZ;
> + else
> + first_blkno = strtol(segmentpath + 1, &end, 10) * RELSEG_SIZE;
> + if (*end != '\0')
> ereport(ERROR,
> - (errmsg("invalid segment number %d in file \"%s\"",
> - segmentno, filename)));
> + (errmsg("invalid segment number in file \"%s\"",
> + filename)));
> }
> + else
> + first_blkno = 0;
> }
> }

Hm. Not a fan of just using strstr() here. Can't quite articulate
why. Just somehow rubs me wrong.

> /*
> * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
> * a relcache entry for the relation.
> - *
> - * NB: At present, this function may only be used on permanent relations, which
> - * is OK, because we only use it during XLOG replay. If in the future we
> - * want to use it on temporary or unlogged relations, we could pass additional
> - * parameters.
> */
> Buffer
> ReadBufferWithoutRelcache(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum,
> BlockNumber blockNum, ReadBufferMode mode,
> - BufferAccessStrategy strategy)
> + BufferAccessStrategy strategy,
> + char relpersistence)
> {
> bool hit;
>
> - SMgrRelation smgr = smgropen(smgrid, rnode, InvalidBackendId);
> -
> - Assert(InRecovery);
> + SMgrRelation smgr = smgropen(smgrid, rnode,
> + relpersistence == RELPERSISTENCE_TEMP
> + ? MyBackendId : InvalidBackendId);
>
> - return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
> + return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
> mode, strategy, &hit);
> }

Hm. Using this for undo access means that we don't do any buffer
read/hit counting associatable with the relation causing undo to be
read/written. That seems like a sizable monitoring defficiency.

> /*
> + * ForgetBuffer -- drop a buffer from shared buffers
> + *
> + * If the buffer isn't present in shared buffers, nothing happens. If it is
> + * present, it is discarded without making any attempt to write it back out to
> + * the operating system. The caller must therefore somehow be sure that the
> + * data won't be needed for anything now or in the future. It assumes that
> + * there is no concurrent access to the block, except that it might be being
> + * concurrently written.
> + */
> +void
> +ForgetBuffer(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum,
> + BlockNumber blockNum)
> +{
> + SMgrRelation smgr = smgropen(smgrid, rnode, InvalidBackendId);
> + BufferTag tag; /* identity of target block */
> + uint32 hash; /* hash value for tag */
> + LWLock *partitionLock; /* buffer partition lock for it */
> + int buf_id;
> + BufferDesc *bufHdr;
> + uint32 buf_state;
> +
> + /* create a tag so we can lookup the buffer */
> + INIT_BUFFERTAG(tag, smgrid, smgr->smgr_rnode.node, forkNum, blockNum);
> +
> + /* determine its hash code and partition lock ID */
> + hash = BufTableHashCode(&tag);
> + partitionLock = BufMappingPartitionLock(hash);
> +
> + /* see if the block is in the buffer pool */
> + LWLockAcquire(partitionLock, LW_SHARED);
> + buf_id = BufTableLookup(&tag, hash);
> + LWLockRelease(partitionLock);
> +
> + /* didn't find it, so nothing to do */
> + if (buf_id < 0)
> + return;
> +
> + /* take the buffer header lock */
> + bufHdr = GetBufferDescriptor(buf_id);
> + buf_state = LockBufHdr(bufHdr);
> + /*
> + * The buffer might been evicted after we released the partition lock and
> + * before we acquired the buffer header lock. If so, the buffer we've
> + * locked might contain some other data which we shouldn't touch. If the
> + * buffer hasn't been recycled, we proceed to invalidate it.
> + */
> + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
> + bufHdr->tag.blockNum == blockNum &&
> + bufHdr->tag.forkNum == forkNum)
> + InvalidateBuffer(bufHdr); /* releases spinlock */
> + else
> + UnlockBufHdr(bufHdr, buf_state);

Phew, I don't like this code one bit. It imo is a bad idea / unnecessary
to look up the buffer, unlock the partition lock, and then recheck
identity. And do exactly the same thing again in InvalidateBuffer()
(including making a copy of the tag while holding the buffer header
lock).

Seems like this should be something roughly like

ReservePrivateRefCountEntry();

LWLockAcquire(partitionLock, LW_SHARED);
buf_id = BufTableLookup(&tag, hash);
if (buf_id >= 0)
{
bufHdr = GetBufferDescriptor(buf_id);

buf_state = LockBufHdr(bufHdr);

/*
* Temporarily acquire pin - that prevents the buffer
* from being replaced with one that we did not intend
* to target.
*
* XXX:
*/
ref = PinBuffer_Locked(bufHdr, strategy);

/* release partition lock, acquire exclusively so we can drop */
LWLockRelease(partitionLock);

/* loop until nobody else has the buffer pinned */
while (true)
{
LWLockAcquire(partitionLock, LW_EXCLUSIVE);

buf_state = LockBufHdr(buf);

/*
* Check if somebody else is busy writing the buffer (we
* have one pin).
*/
if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
break;

// XXX: Should we assert IO_IN_PROGRESS? Ought to be the
// only way to get here.

/* wait for IO to finish, without holding locks */
UnlockBufHdr(buf, buf_state);
LWLockRelease(partitionLock);

Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) == 1);
WaitIO(buf);

/* buffer identity can't change, we've a pin */

// XXX: Assert that the buffer isn't dirty anymore? There
// ought to be no possibility for it to get dirty now.
}

Assert(!(buf_state & BM_PIN_COUNT_WAITER));

/*
* Clear out the buffer's tag and flags. We must do this to ensure that
* linear scans of the buffer array don't think the buffer is valid.
*/
oldFlags = buf_state & BUF_FLAG_MASK;
CLEAR_BUFFERTAG(buf->tag);
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
/* remove our refcount */
buf_state -= BUF_REFCOUNT_ONE;
UnlockBufHdr(buf, buf_state);

/*
* Remove the buffer from the lookup hashtable, if it was in there.
*/
if (oldFlags & BM_TAG_VALID)
BufTableDelete(&oldTag, oldHash);

/*
* Done with mapping lock.
*/
LWLockRelease(oldPartitionLock);

Assert(ref->refcount == 1);
ForgetPrivateRefCountEntry(ref);
ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf));
}

or something in that vein. Now you can validly argue that this is more
complicated - but I also think that this is going to be a much hotter
path than normal relation drops.

<more after some errands>

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2019-07-29 23:37:26 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Previous Message Alexander Korotkov 2019-07-29 22:27:56 Re: Define jsonpath functions as stable