Quick Links

Re: POC: Cleaning up orphaned files using undo logs

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: POC: Cleaning up orphaned files using undo logs
Date:	2019-08-07 10:57:08
Message-ID:	e8116a4f-de54-4d2b-3e25-4937364749e5@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 05/08/2019 16:24, Robert Haas wrote:
> On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> I feel that the level of abstraction is not quite right. There are a
>> bunch of fields, like uur_block, uur_offset, uur_tuple, that are
>> probably useful for some UNDO resource managers (zheap I presume), but
>> seem kind of arbitrary. How is uur_tuple different from uur_payload?
>> Should they be named more generically as uur_payload1 and uur_payload2?
>> And why two, why not three or four different payloads? In the WAL record
>> format, there's a concept of "block id", which allows you to store N
>> number of different payloads in the record, I think that would be a
>> better approach. Or only have one payload, and let the resource manager
>> code divide it as it sees fit.
>>
>> Many of the fields support a primitive type of compression, where a
>> field can be omitted if it has the same value as on the first record on
>> an UNDO page. That's handy. But again I don't like the fact that the
>> fields have been hard-coded into the UNDO record format. I can see e.g.
>> the relation oid to be useful for many AMs. But not all. And other AMs
>> might well want to store and deduplicate other things, aside from the
>> fields that are in the patch now. I'd like to move most of the fields to
>> AM specific code, and somehow generalize the compression. One approach
>> would be to let the AM store an arbitrary struct, and run it through a
>> general-purpose compression algorithm, using the UNDO page's first
>> record as the "dictionary".
>
> I thought about this, too. I agree that there's something a little
> unsatisfying about the current structure, but I haven't been able to
> come up with something that seems definitively better. I think
> something along the lines of what you are describing here might work
> well, but I am VERY doubtful about the idea of a fixed-size struct. I
> think AMs are going to want to store variable-length data: especially
> tuples, but maybe also other stuff. For instance, imagine some AM that
> wants to implement locking that's more fine-grained that the four
> levels of tuple locks we have today: instead of just having key locks
> and all-columns locks, you could want to store the exact columns to be
> locked. Or maybe your TIDs are variable-width.

Sure, a fixed-size struct is quite limiting. My point is that all that
packing of data into UNDO records should be AM-specific. Maybe it would
be handy to have a a few common fields in the undo record header itself,
but most data should be in the AM-specific payload, because it varies
across AMs.

> And the problem is that as soon as you move to something where you
> pack in a bunch of variable-sized fields, you lose the ability to
> refer to thinks using reasonable names. That's where I came up with
> the idea of an UnpackedUndoRecord: give the common fields that
> "everyone's going to need" human-readable names, and jam only the
> strange, AM-specific stuff into the payload. But if those needs are
> not actually universal but very much AM-specific, then I'm afraid
> we're going to end up with deeply inscrutable code for packing and
> unpacking records. I imagine it's possible to come up with a good
> structure for that, but I don't think we have one today.

Yeah, that's also a problem with complicated WAL record types. Hopefully
the complex cases are an exception, not the norm. A complex case is
unlikely to fit any pre-defined set of fields anyway. (We could look at
how e.g. protobuf works, if this is really a big problem. I'm not
suggesting that we add a dependency just for this, but there might be
some patterns or interfaces that we could mimic.)

If you remember, we did a big WAL format refactoring in 9.5, which moved
some information from AM-specific structs to the common headers. Namely,
the information on the relation blocks that the WAL record applies to.
That was a very handy refactoring, and allowed tools like pg_waldump to
print more detailed information about all WAL record types. For WAL
records, moving the block information was natural, because there was
special handling for full-page images anyway. However, I don't think we
have enough experience with UNDO log yet, to know which fields would be
best to include in the common undo header, and which to leave as
AM-specific payload. I think we should keep the common header slim, and
delegate to the AM routines.

For UNDO records, having an XID on every record probably makes sense;
all the use cases for UNDO log we've discussed are transactional. The
rules on which UNDO records to apply and what/when to discard, depend on
whether a transaction committed or aborted and when, so you need the XID
for that. Although, the rule also depends on the AM; for cleaning up
orphaned files, an UNDO record for a committed transaction can be
discarded immediately, while zheap and zedstore records need to be kept
around longer. So the exact rules for that will need to be AM-specific,
too. Or maybe there are only a few different cases and we can enumerate
them, so that an AM can just set a flag on the UNDO record to indicate
when it can be discarded, instead of having a callback or some other
totally generic approach.

In short, I think we should keep the common code that deals with UNDO
records more dumb, and delegate to the AMs more. That's enough for
cleaning up orphaned files, we don't really need the more complicated
stuff for that. We probably need more smarts for zheap/zedstore, but we
don't quite know what it should look like yet. Let's keep it simple for
now, so that we can get something we can review and commit sooner, and
we can build on top of that later.

>> I don't like the way UndoFetchRecord returns a palloc'd
>> UnpackedUndoRecord. I would prefer something similar to the xlogreader
>> API, where a new call to UndoFetchRecord invalidates the previous
>> result. On efficiency grounds, to avoid the palloc, but also to be
>> consistent with xlogreader.
>
> I don't think that's going to work very well, because we often need to
> deal with multiple records at a time. There is (or was) a bulk-fetch
> interface, but I've also found while experimenting with this code that
> it can be useful to do things like:
>
> current = undo_fetch(starting_record);
> loop:
> next = undo_fetch(current->next_record_ptr);
> if some_test(next):
> break;
> undo_free(current);
> current = next;
>
> I think we shouldn't view such cases as exceptions to the general
> paradigm of looking at undo records one at a time, but instead as the
> normal case for which everything is optimized. Cases like orphaned
> file cleanup where the number of undo records is probably small and
> they're all independent of each other will, I think, turn out to be
> the exception rather than the rule.

Hmm. If you're following an UNDO chain, from newest to oldest, I would
assume that the newer record has enough information to decide whether
you need to look at the previous record. If the previous record is no
longer interesting, it might already be discarded away, after all.

I tried to browse through the zheap code but couldn't see that pattern.
I'm not too familiar with the code, so I might've looked in the wrong
place, though.

- Heikki

In response to

Re: POC: Cleaning up orphaned files using undo logs at 2019-08-05 13:24:50 from Robert Haas

Responses

Re: POC: Cleaning up orphaned files using undo logs at 2019-08-09 16:13:15 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Konstantin Knizhnik	2019-08-07 10:57:27	Re: Built-in connection pooler
Previous Message	vignesh C	2019-08-07 10:55:55	Re: Store FullTransactionId in TwoPhaseFileHeader/GlobalTransactionData