Re: POC: Cleaning up orphaned files using undo logs

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-09 08:39:16
Message-ID: CAA4eK1K_6wQPd=f4JeeWeaJcKU-YopVd90WUouuKVodGRKFOAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>
> --------------------
>
> Some further review comments for undoworker.c :
>
>
> +/* Sets the worker's lingering status. */
> +static void
> +UndoWorkerIsLingering(bool sleep)
> The function name sounds like "is the worker lingering ?". Can we
> rename it to something like "UndoWorkerSetLingering" ?
>

makes sense, changed as per suggestion.

> -------------
>
> + errmsg("undo worker slot %d is empty, cannot attach",
> + slot)));
>
> + }
> +
> + if (MyUndoWorker->proc)
> + {
> + LWLockRelease(UndoWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("undo worker slot %d is already used by "
> + "another worker, cannot attach", slot)));
>
> These two error messages can have a common error message "could not
> attach to worker slot", with errdetail separate for each of them :
> slot %d is empty.
> slot %d is already used by another worker.
>
> --------------
>

Changed as per suggestion.

> +static int
> +IsUndoWorkerAvailable(void)
> +{
> + int i;
> + int alive_workers = 0;
> +
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
>
> Should have bool return value.
>
> Also, why is it keeping track of number of alive workers ? Sounds like
> earlier it used to return number of alive workers ? If it indeed needs
> to just return true/false, we can do away with alive_workers.
>
> Also, *exclusive* lock is unnecessary.
>
> --------------
>

Changed as per suggestion. Additionally, I changed the name of the
function to UndoWorkerIsAvailable(), so that it is similar to other
functions in the file.

> +if (UndoGetWork(false, false, &urinfo, NULL) &&
> + IsUndoWorkerAvailable())
> + UndoWorkerLaunch(urinfo);
>
> There is no lock acquired between IsUndoWorkerAvailable() and
> UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable()
> returns true, there is a small window where UndoWorkerLaunch() does
> not find any worker slot with in_use false, causing assertion failure
> for (worker != NULL).
> --------------
>

I have removed the assert and instead added a warning. I have also
added a comment from the place where we call UndoWorkerLaunch to
mention the race condition.

> +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> +{
> + /* Block concurrent access. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> *Exclusive* lock is unnecessary.
> -------------
>

Right, changed to share lock.

> + LWLockRelease(UndoWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("undo worker slot %d is empty",
> + slot)));
> I believe there is no need to explicitly release an lwlock before
> raising an error, since the lwlocks get released during error
> recovery. Please check all other places where this is done.
> -------------
>

Fixed.

> + * Start new undo apply background worker, if possible otherwise return false.
> worker, if possible otherwise => worker if possible, otherwise
> -------------
>
> +static bool
> +UndoWorkerLaunch(UndoRequestInfo urinfo)
> We don't check UndoWorkerLaunch() return value. Can't we make it's
> return value type void ?
>

Now, the function returns void and accordingly I have adjusted the
comment which should address both the above comments.

> Also, it would be better to have urinfo as pointer to UndoRequestInfo
> rather than UndoRequestInfo, so as to avoid structure copy.
> -------------
>

Okay, changed as per suggestion.

> +{
> + BackgroundWorker bgw;
> + BackgroundWorkerHandle *bgw_handle;
> + uint16 generation;
> + int i;
> + int slot = 0;
> We can remove variable i, and use slot variable in place of i.
> -----------
>
> + snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker");
> I think it would be trivial to also append the worker->generation in
> the bgw_name.
> -------------
>

I am not sure if adding 'generation' is any useful. It might be better
to add database id as each worker can work on a particular database,
so that could be useful information.

>
> + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
> + {
> + /* Failed to start worker, so clean up the worker slot. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> + UndoWorkerCleanup(worker);
> + LWLockRelease(UndoWorkerLock);
> +
> + return false;
> + }
>
> Is it intentional that there is no (warning?) message logged when we
> can't register a bg worker ?
> -------------
>

Added a warning in that code path.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-08-09 08:54:50 Re: POC: Cleaning up orphaned files using undo logs
Previous Message Heikki Linnakangas 2019-08-09 08:34:05 default_table_access_method is not in sample config file