Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Peter Geoghegan <peter(at)2ndquadrant(dot)com>, hlinnakangas(at)vmware(dot)com
Subject: Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel
Date: 2012-10-11 02:21:10
Message-ID: 201210110421.15249.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

First of: Awesome review.

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
> What follows is an initial overview of the patch (or at least my
> understanding of the patch, which you may need to correct), and some
> points of concern.
>
> > * applycache module which reassembles transactions from a stream of
> > interspersed changes
>
> This is what the design doc [2] refers to as "4.5. TX reassembly".
>
> This functionality is concentrated in applycache.c. As [2] notes, the
> reassembly component "was previously coined ApplyCache because it was
> proposed to run on replication consumers just before applying changes.
> This is not the case anymore. It is still called that way in the
> source of the patch recently submitted."
>
> The purpose of ApplyCache/transaction reassembly is to reassemble
> interlaced records, and organise them by XID, so that the consumer
> client code sees only streams (well, lists) of records split by XID.

> I meant to avoid talking about anything other than the bigger picture
> for now, but I must ask: Why the frequent use of malloc(),
> particularly within applycache.c? The obvious answer is that it's
> rough code and that that will change, but that still doesn't comport
> with my idea about how rough Postgres code should look, so I have to
> wonder if there's a better reason.
Several reasons, not sure how good:
- part of the code (was) supposed to be runnable on a target system without a
full postgres backend arround
- All of the allocations would basically have to be in TopMemoryContext or
something equally longlived as we don't have a transaction context or anything
like that
- the first revision showed that allocating memory was the primary bottleneck
so I added a small allocation cache to the critical pieces which solved that.
After that there seems no point in using mctx''s for that kind of memory
anymore.

> applycache.c has an acute paucity of comments, which makes it really
> hard to review well.
Working on that. I don't think its internals are really all that interesting
atm.

> I'm going to not comment much on this here, except to say that
> I think that the file should be renamed to reassembly.c or something
> like that, to reflect its well-specified purpose, and not how it might
> be used.
Agreed.

> Applycache is presumably where you're going to want to spill
> transaction streams to disk, eventually. That seems like a
> prerequisite to commit.
Yes.

> By the way, I see that you're doing this here:
>
> + /* most callers don't need snapshot.h */
> + typedef struct SnapshotData *Snapshot;
>
> Tom, Alvaro and I had a discussion about whether or not this was an
> acceptable way to reduce build dependencies back in July [8] – I lost
> that one. You're relying on a Gnu extension/C11 feature here (that is,
> typedef redefinition). If you find it intensely irritating that you
> cannot do this while following the standard to the letter, you are not
> alone.
Yuck :(. So I will just use struct SnapshotData directly instead of a
typedef...

> > * snapbuilder which builds catalog snapshots so that tuples from wal can
> > be understood
>
> This component analyses xlog and builds a special kind of Snapshot.
> This has been compared to the KnownAssignedXids machinery for Hot
> Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is
> meant by this). Since decoding only has to occur within a single
> backend, I guess it's sufficient that it's all within local memory (in
> contrast to the KnownAssignedXids array, which is in shared memory).
I haven't found a convincing argument to share that information itself. If we
want to parallelise this I think sharing the snapshots should be sufficient.
Given the snapshot only covers system catalogs the changerate should be fine.

> The design document [2] really just explains the problem (which is the
> need for catalog metadata at a point in time to make sense of heap
> tuples), without describing the solution that this patch offers with
> any degree of detail. Rather, [2] says "How we build snapshots is
> somewhat intricate and complicated and seems to be out of scope for
> this document", which is unsatisfactory. I look forward to reading the
> promised document that describes this mechanism in more detail. It's
> really hard to guess what you might have meant to do, and why you
> might have done it, much less verifying the codes correctness.
Will concentrate on finishing that document.

> This functionality is concentrated in snapbuild.c. A comment in decode.c
> notes:
>
> + * Its possible that the separation between decode.c and snapbuild.c is
> a + * bit too strict, in the end they just about have the same struct.
>
> I prefer the current separation. I think it's reasonable that decode.c
> is sort of minimal glue code.
Good.

> Decoding means breaking up individual XLogRecord structs, and storing
> them in an applycache (applycache does this, and stores them as
> ApplyCacheChange records), while building a snapshot (which is needed
> in advance of adding tuples from records). It can be thought of as the
> small piece of glue between applycache and snapbuild that is called by
> XLogReader (DecodeRecordIntoApplyCache() is the only public function,
> which will be called by many xlogreader_state.finished_record-hooked
> plugin functions in practice, including this example one). An example
> of what belongs in decode.c is the way it ignores physical
> XLogRecords, because they are not of interest.
>
> By the way, why not just use struct assignment here?:
>
> + memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
At the time I initially wrote the code that seemed to be the project standard.
I think this only recently changed.

> > * decode_xlog(lsn, lsn) debugging function
>
> You consider this to be a throw-away function that won't ever be
> committed.
Only in its current hacked-together form.

> However, I strongly feel that you should move it into
> /contrib, so that it can serve as a sort of reference implementation
> for authors of decoder client code, in the same spirit as numerous
> existing contrib modules (think contrib/spi). I think that such a
> module could even be useful to people that were just a bit
> intellectually curious about how WAL works, which is something I'd
> like to encourage. Besides, simply having this code in a module will
> more explicitly demarcate client code (just look at logicalfuncs.c –
> it is technically client code, but that's too subtle right now).
I definitely think we need something like this, but it might look a bit
different. For reasons you found later (xmin horizon) I don't think we can run
it in the context of a normal backend for one.

> I don't like this code in decode_xlog():
>
> + apply_state = (ReaderApplyState*)xlogreader_state->private_data;
>
> Why is it necessary to cast here? In other words, why is private_data
> a void pointer at all?
The reason is that for a long time I hoped to keep ApplyCache's generic enough
to be usable outside of this exact scenario. I am not sure if thats a
worthwile goal anymore, especially as there are some layering violations now.

> > The applycache provides 3 major callbacks:
> > * apply_begin
> > * apply_change
> > * apply_commit
>
> These are callbacks intended to be used by third-party modules,
> perhaps including a full multi-master replication implementation
> (though this patch isn't directly concerned with that), or even a
> speculative future version of a logical replication system like Slony.
> [2] refers to these under "4.7. Output Plugin". These are the typedef
> for the hook types involved (incidentally, don't like the name of
> these types – I think you should lose the CB):
Not particularly attached to the CB, so I can loose it.

> +/* XXX: were currently passing the originating subtxn. Not sure thats
> necessary */
> +typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache,
> ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change);
> +typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn);
> +typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn);
>
> So we register these callbacks like this in the patch:
>
> + /*
> + * allocate an ApplyCache that will apply data using lowlevel calls
> + * without type conversion et al. This requires binary compatibility
> + * between both systems.
> + * XXX: This would be the place too hook different apply methods, like
> + * producing sql and applying it.
> + */
> + apply_cache = ApplyCacheAllocate();
> + apply_cache->begin = decode_begin_txn;
> + apply_cache->apply_change = decode_change;
> + apply_cache->commit = decode_commit_txn;
>
> The decode_xlog(lsn, lsn) debugging function that Andres has played
> with [6] (that this patch makes available, for now) is where this code
> comes from.
>
> Whenever ApplyCache calls an "apply_change" callback for a single
> change (that is, a INSERT|UPDATE|DELETE) it locally overrides the
> normal SnapshotNow semantics used for catalog access with a previously
> built snapshot. Behaviour should now be consistent with a normal
> snapshotNow acquired when the tuple change was originally written to
> WAL.
I additionally want pass a mvcc-ish Snapshot (should just need a a change in
function signatures) so the output plugins can query the catalog manually.

> Having commented specifically on modules that Andres highlighted, I'd
> like to highlight one myself: tqual.c . This module has had
> significant new functionalty added, so it would be an omission to not
> pass remark on it in this opening e-mail, having mentioned all other
> modules with significant new pieces of functionality. The file has had
> new utility functions added, that pertain to snapshot visibility
> during decoding - "time travel".
For me that mentally was part of catalog timetravel, thats why I didn't
highlight it separately. But yes, this needs to be discussed.

> I draw attention to this. This code is located within the new function
> HeapTupleSatisfiesMVCCDuringDecoding(), which is analogous to what is
> done for "dirty snapshots" (dirty snapshots are used with
> ri_triggers.c, for example, when even uncommitted tuple should be
> visible).
Not sure where you see the similarity with dirty snapshots?

> Both of these functions are generally accessed through
> function pointers. Anyway, here's the code:
>
> + /*
> + * FIXME: The not yet existing decoding infrastructure will need to
> force + * the xmin to stay lower than what they are currently
decoding.
> + */
> + bool fixme_xmin_horizon = false;
>
> I'm sort of wondering what this is going to look like in the finished
> patch. This FIXME is rather hard for me to take at face value. It
> seems to me that the need to coordinate decoding with xmin horizon
> itself represents a not insignificant engineering challenge. So, with
> all due respect, it would be nice if I wasn't asked to make that leap
> of faith. The xmin horizon prepared transaction hack needs to go.
The idea is to do the decoding inside (or in something very similar) like
walsenders. Those already have the capability to keep the xmin horizon nailed
to some point for the hot_standby_feedback feature. There is a need for
something similar like what prepared statements do for restartability after a
restart though.

> Within tqual.h, shouldn't you have something like this, but for time
> travel snapshots during decoding?:
>
> #define InitDirtySnapshot(snapshotdata) \
> ((snapshotdata).satisfies = HeapTupleSatisfiesDirty)
Hm. Ok.

> Alternatively, adding a variable to these two might be appropriate:
>
> static SnapshotData CurrentSnapshotData = {HeapTupleSatisfiesMVCC};
> static SnapshotData SecondarySnapshotData = {HeapTupleSatisfiesMVCC};
Not sure where youre going from this?

> My general feeling is that the code is very under-commented, and in
> need of a polish, though I'm sure that you are perfectly well aware of
> that.
Totally aggreed.

> The basic way all of these components that I have described
> separately fit together is: (if others want to follow this, refer to
> decode_xlog())
>
> 1. Start with some client code “output plugin” (for now, a throw-away
> debugging function “decode_xlog()”)
The general idea is to integrate this into the walsender framework in a
command very roughly looking like:
START_LOGICAL_REPLICATION $plugin LSN

> \ /
> 2. Client allocates an XLogReaderState. (This module is a black box to
> me, though it's well encapsulated so that shouldn't matter much.
> Heikki is reviewing this [7]. Like I said, this isn't quite an atomic
> unit I'm reviewing.)
>
> \ /
> 3. Plugin registers various callbacks (within logicalfuncs.c). These
> callbacks, while appearing in this patch, are mostly NO-OPS, and are
> somewhat specific to XLogReader's concerns. I mostly defer to Heikki
> here.
>
> \ /
> 4. Plugin allocates an “ApplyCache”. Plugin assigns some more
> callbacks to “ApplyCache”. This time, they're the aforementioned 3
> apply cache functions.
>
> \ /
> 5. Plugin assigns this new ApplyCache to variable within private state
> of the XLogReader (this private state is a subset of its main state,
> and is opaque to XLogReader).
>
> \ /
> 6. Finally, plugin calls XLogReader(main_state).
>
> \ /
> 7. At some point during its magic,
> XLogReader calls the hook registered in step 3, finished_record.
> This is all it does directly
> with the plugin, which it makes minimal assumptions about.
>
> \ /
> 8. finished_record (which is
> logically a part of the “plugin”) knows what type the opaque
> private_data
> actually is. It casts it
> to an apply_state, and calls the decoder (supplying the apply_state as
> an argument to
> DecodeRecordIntoApplyCache()).
>
> \ /
> 9. During the first call
> (within the first record within a call to decode_xlog()), we allocate
> a snapshot reader.
>
> \ /
> 10. Builds snapshot callback.
> This scribbles on our snapshot state, which essentially encapsulates a
> snapshot.
> The state (and
> snapshot) changes continually, once per call.
It changes only if there were catalog changes in the transaction and/or we
haven't yet build an initial snapshot.

> \ /
> 11. Looks at XLogRecordBuffer
> (an XLogReader struct). Looks at an XLogRecord. Decodes based on
> record type.
> Let's assume it's an
> XLOG_HEAP_INSERT.
>
> \ /
> 12. DecodeInsert() called.
> This in turn calls DecodeXLogTuple(). We store the tuple metadata in
> our
> ApplyCache. (some
> ilists, somewhere, each corresponding to an XID). We don't store the
> relation oid, because we don't
> know it yet (only
> relfilenode is known from WAL).
> /
> /
> \ /
> 13. We're back in XLogReader(). It
> calls the only callback of interest to
> us covered in step 3 (and
> not of interest to XlogReader()/Heikki) – decode_change(). It does
> this through the
> apply_cache.apply_change
> hook. This happens because we encounter another record, this time a
> commit record (in the same
> codepath as discussed in step 12).
>
> \ /
> 14. In decode_change(), the actual
> function that raises the interesting WARNINGs within Andres'
> earlier example [6], showing
> actual integer/varchar Datum value for tuples previously inserted.
> Resolve table oid based on
> relfilenode (albeit unsatisfactorily).
> Using a StringInfo,
> tupledescs, syscache and typcache, build WARNING string.
>
> So my general opinion of how all this fits together is that it isn't
> quite right. Problems include:
I think its more understandable if you think that normally the initialization
code shouldn't be repeated in the plugins but the plugins are just called from
a walsender.

> * Why does the ReaderApplyState get magically initialised in two
> stages? apply_cache is initialised in decode_xlog (or whatever
> plugin). Snapstate is allocated within DecodeRecordIntoApplyCache()
> (with a dependency on apply_cache). Shouldn't this all be happening
> within a single function? As you yourself have point out, not everyone
> needs to know about these snapshots.
The reason is/was that I didn't want the outside know anything about the
Snapshot building, that seems to be only relevant to decode.c (and somewhat
applycache.c).
I can look at it though.

> * Maybe I've missed something, but I think you need a more
> satisfactory example plugin. What happens in step 14 is plainly
> unacceptable. You haven't adequately communicated to me how this is
> going to be used in logical replication. Maybe I just haven't got that
> far yet.
Well, what would you like to happen instead? Does it make more sense with the
concept that the output plugin will eventually run inside walsender(-ish) and
stream changes via COPY? I just couldn't finish a POC for that in time.

> I'm not impressed by the InvalidateSystemCaches() calls here
> and elsewhere.
Yes, those definitely got to go and we have nearly all the infrastructure for
it from Hot Standby. It was noted as a deficiency in the overview mail & the
commit message though ;). I have that locally.

The rough idea here is to reuse xl_xact_commit->nmsgs to handle the cache
behaviour. There are some more complications but let me first write the
timetravel document.

> * Please break-out the client code as a contrib module. That
> separation would increase the readability of the patch.
Well, as I said above the debugging function in its current form is not going
to walk, as soon we have aggreed on how to integrate this into walsender I am
definitely going to provide a example/debugging plugin which I definitely intend
to submit.

Thanks!

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-10-11 02:26:26 Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel
Previous Message Bruce Momjian 2012-10-11 02:17:10 Re: Warnings from fwrite() in git head