Re: POC: Cleaning up orphaned files using undo logs

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-07-26 04:08:32
Message-ID: CAA4eK1JgfJzvyiCAph6w7QAdVw6TTdZcUoA1J6D259QyDxUWLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 25, 2019 at 11:22 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > I have done some more review of undolog patch series and here are my comments:
>
> Hi Amit,
>
> Thanks! There a number of actionable changes in your review. I'll be
> posting a new patch set soon that will address most of your complaints
> individually. In this message want to respond to one topic area,
> because the answer is long enough already:
>
> > 2.
> > allocate_empty_undo_segment()
> > {
> > ..
> > ..
> > /* Flush the contents of the file to disk before the next checkpoint. */
> > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> > ..
> > }
> >
> > +void
> > +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace)
> > +{
> > + char path[MAXPGPATH];
> > + FileTag tag;
> > +
> > + INIT_UNDOFILETAG(tag, logno, tablespace, segno);
> > +
> > + /* Try to send to the checkpointer, but if out of space, do it here. */
> > + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
> >
> >
> > The comment in allocate_empty_undo_segment indicates that the code
> > wants to flush before checkpoint, but the actual function tries to
> > register the request with checkpointer. Shouldn't this be similar to
> > XLogFileInit where we use pg_fsync to flush the contents immediately?
> > I guess that will avoid what you have written in comments in the same
> > function (we just want to make sure that the filesystem has allocated
> > physical blocks for it so that non-COW filesystems will report ENOSPC
> > now rather than later when space is needed). OTOH, I think it is
> > performance-wise better to postpone the work to checkpointer. If we
> > want to push this work to checkpointer, then we might need to change
> > comments or alternatively, we might want to use bigger segment sizes
> > to mitigate the performance effect.
>
> In an early version I was doing the fsync() immediately. While
> testing zheap, Mithun CY reported that whenever segments couldn't be
> recycled in the background, such as during a bit long-running
> transaction, he could measure ~6% of the time time spent waiting for
> fsync(), and throughput increased with bigger segments (and thus fewer
> files to fsync()). Passing the work off to the checkpointer is better
> not only because it's done in the background but also because there is
> a chance that the work can be consolidated with other sync requests,
> and perhaps even avoided completely if the file is discarded and
> unlinked before the next checkpoint.
>
> I'll update the comment to make it clearer.
>

Okay, that makes sense.

> > If my above understanding is correct and the reason to fsync
> > immediately is to reserve space now, then we also need to think
> > whether we are always safe in postponing the work? Basically, if this
> > means that it can fail when we are actually trying to write undo, then
> > it could be risky because we could be in the critical section at that
> > time. I am not sure about this point, rather it is just to discuss if
> > there are any impacts of postponing the fsync work.
>
> Here is my theory for why this arrangement is safe, and why it differs
> from what we're doing with WAL segments and regular relation files.
> First, let's review why those things work the way they do (as I
> understand it):
>
> 1. WAL's use of fdatasync():
>

I was referring to function XLogFileInit which doesn't appear to be
directly using fdatasync.

>
> 3. Separate size tracking: Another reason that regular relations
> write out zeroes at relation-extension time is that that's the only
..
>
> To summarise, we write zeroes so we can report ENOSPC errors as early
> as possible, but we defer and consolidate fsync() calls because the
> files' contents and names don't actually have to survive power loss
> until a checkpoint says they existed at that point in the WAL stream.
>
> Does this make sense?
>

Yes, this makes sense. However, I wonder if we need to do some
special handling for ENOSPC while writing to file in this function
(allocate_empty_undo_segment). Basically, unlink/remove the file if
fail to write because of disk full, something similar to what we do in
XLogFileInit.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-07-26 04:13:24 Re: SegFault on 9.6.14
Previous Message Alvaro Herrera 2019-07-26 03:30:55 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)