Re: POC: Cleaning up orphaned files using undo logs

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-16 03:24:20
Message-ID: CAFiTN-vtM1Zk4eJARebCc_eHdnqp342CacpxCp9HBoyt2UYDxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > - I think there's two fairly fundamental, and related, problems with
> > > the sequence outlined above:
> > >
> > > - We can't search for target buffers to store undo data, while holding
> > > the "table" page content locked.
> > >
> > > The easy way to solve would be to require sites performing UNDO
> > > logging to acquire victim pages before even acquiring other content
> > > locks. Perhaps the better approach could be for the undo layer to
> > > hold onto a number of clean buffers, and to keep the last page in an
> > > already written to undo log pinned.
> > >
> > > - We can't search for victim buffers for further undo data while
> > > already holding other undo pages content locked. Doing so means that
> > > while we're e.g. doing IO to clean out the new page, old undo data
> > > on the previous page can't be read.
> > >
> > > This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> > > pinning and locking buffers, it'd need to be split into two
> > > operations. One that acquires buffers and pins them, and one that
> > > locks them. I think it's quite possible that the locking operation
> > > could just be delayed until InsertPreparedUndo(). But if we solve
> > > the above problem, most of this might already be solved.
> >
> > Basically, that means
> > - the caller should call PreparedUndoInsert before acquiring table
> > page content lock right? because the PreparedUndoInsert just compute
> > the size, allocate the space and pin+lock the buffers and for pinning
> > the buffers we must compute the size and allocate the space using undo
> > storage layer.
>
> I don't think we can normally pin the undo buffers properly at that
> stage. Without knowing the correct contents of the table page - which we
> can't know without holding some form of lock preventing modifications -
> we can't know how big our undo records are going to be. And we can't
> just have buffers that don't exist on disk in shared memory, and we
> don't want to allocate undo that we then don't need. So I think what
> we'd have to do at that stage, is to "pre-allocate" buffers for the
> maximum amount of UNDO needed, but mark the associated bufferdesc as not
> yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> would not be set.
>
> So at the start of a function that will need to insert undo we'd need to
> pre-reserve the maximum number of buffers we could potentially
> need. That reservation stage would

Maybe we can provide an interface where the caller will input the max
prepared undo (maybe BeginUndoRecordInsert) and based on that we can
compute the max number of buffers we could potentially need for this
particular operation. Most of the operation insert/update/delete will
need 1 or 2 undo record so we can avoid pinning very high number of
buffers in most of the cases. Currently, only for the multi-insert
implementation of zheap we might need multiple undo-records (1 undo
per range of record).

> a) pin the page with the current end of the undo
> b) if needed pin the page of older undo that we need to update (e.g. to
> update the next pointer)
> c) perform clock sweep etc to acquire (find or create) enough clean to
> hold the maximum amount of undo needed. These buffers would be marked
> as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
>
> I assume that we'd make a) cheap by keeping it pinned for undo logs that
> a backend is actively attached to. b) should only be needed once in a
> transaction, so it's not too bad. c) we'd probably need to amortize
> across multiple undo insertions, by keeping the unused buffers pinned
> until the end of the transaction.
>
> I assume that having the infrastructure c) might also make some code
> for already in postgres easier. There's obviously some issues around
> guaranteeing that the maximum number of such buffers isn't high.
>
>
> > - So basically, if we delay the lock till InsertPreparedUndo and call
> > PrepareUndoInsert before acquiring table page content lock this
> > problem is solved?
> >
> > Although I haven't yet analyzed the AM specific part that whether it's
> > always possible to call the PrepareUndoInsert(basically getting all
> > the undo record ready) before the page content lock. But, I am sure
> > that won't be much difficult part.
>
> I think that is somewhere between not possible, and so expensive in a
> lot of cases that we'd not want to do it anyway. You'd at leasthave to
> first acquire a content lock on the page, mark the target tuple as
> locked, then unlock the page, reserve undo, lock the table page,
> actually update it.
>
>
> > > - When reading an undo record, the whole stage of UnpackUndoData()
> > > reading data into a the UndoPackContext is omitted, reading directly
> > > into the UnpackedUndoRecord. That removes one further copy of the
> > > record format.
> > So we will read member by member to UnpackedUndoRecord? because in
> > context we have at least a few headers packed and we can memcpy one
> > header at a time like UndoRecordHeader, UndoRecordBlock.
>
> Well, right now you then copy them again later, so not much is gained by
> that (although that later copy can happen without the content lock
> held). As I think I suggested before, I suspect that the best way would
> be to just memcpy() the data from the page(s) into an appropriately
> sized buffer with the content lock held, and then perform unpacking
> directly into UnpackedUndoRecord. Especially with the bulk API that will
> avoid having to do much work with locks held, and reduce memory usage by
> only unpacking the record(s) in a batch that are currently being looked
> at.
ok.
>
> > But that just a few of them so if we copy field by field in the
> > UnpackedUndoRecord then we can get rid of copying in context then copy
> > it back to the UnpackedUndoRecord. Is this is what in your mind or
> > you want to store these structures (UndoRecordHeader, UndoRecordBlock)
> > directly into UnpackedUndoRecord?
>
> I at the moment see no reason not to?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2019-08-16 04:14:25 Re: POC: Cleaning up orphaned files using undo logs
Previous Message Amit Kapila 2019-08-16 03:19:11 Re: [PATCH] Implement INSERT SET syntax