Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date: 2012-06-20 19:15:25
Message-ID: 201206202115.26350.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wednesday, June 20, 2012 08:32:53 PM Heikki Linnakangas wrote:
> On 20.06.2012 17:35, Simon Riggs wrote:
> > On 20 June 2012 16:23, Heikki Linnakangas
> >
> > <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> >> On 20.06.2012 11:17, Simon Riggs wrote:
> >>> On 20 June 2012 15:45, Heikki Linnakangas
> >>>
> >>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> >>>> So, if the origin id is not sufficient for some conflict resolution
> >>>> mechanisms, what extra information do you need for those, and where do
> >>>> you put it?
> >>>
> >>> As explained elsewhere, wal_level = logical (or similar) would be used
> >>> to provide any additional logical information required.
> >>>
> >>> Update and Delete WAL records already need to be different in that
> >>> mode, so additional info would be placed there, if there were any.
> >>>
> >>> In the case of reflexive updates you raised, a typical response in
> >>> other DBMS would be to represent the query
> >>>
> >>> UPDATE SET counter = counter + 1
> >>>
> >>> by sending just the "+1" part, not the current value of counter, as
> >>> would be the case with the non-reflexive update
> >>>
> >>> UPDATE SET counter = 1
> >>>
> >>> Handling such things in Postgres would require some subtlety, which
> >>> would not be resolved in first release but is pretty certain not to
> >>> require any changes to the WAL record header as a way of resolving it.
> >>> Having already thought about it, I'd estimate that is a very long
> >>> discussion and not relevant to the OT, but if you wish to have it
> >>> here, I won't stop you.
> >>
> >> Yeah, I'd like to hear briefly how you would handle that without any
> >> further changes to the WAL record header.
> >
> > I already did:
> >>> Update and Delete WAL records already need to be different in that
> >>> mode, so additional info would be placed there, if there were any.
> >
> > The case you mentioned relates to UPDATEs only, so I would suggest
> > that we add that information to a new form of update record only.
> >
> > That has nothing to do with the WAL record header.
>
> Hmm, so you need the origin id in the WAL record header to do filtering.
> Except when that's not enough, you add some more fields to heap update
> and delete records.
Imo the whole +1 stuff doesn't have anything to do with the origin_id proposal
and should be ignored for quite a while. We might go to something like it
sometime in the future but its nothing we work on (as far as I know ;)).

wal_level=logical (in patch 07) currently only changes the following things
about the wal stream:

For HEAP_(INSERT|(HOT_)?UPDATE|DELETE)
* prevent full page writes from removing the row data (could be optimized at
some point to just store the tuple slot)

For HEAP_DELETE
* add the primary key of the changed row

HEAP_MULTI_INSERT obviously needs to get the same treatment in future.

The only real addition that I forsee in the near future is logging the old
primary key when the primary key changes in HEAP_UPDATE.

Kevin wants an option for full pre-images of rows in HEAP_(UPDATE|DELETE)

> Don't you think it would be simpler to only add the extra fields to heap
> insert, update and delete records, and leave the WAL record header
> alone? Do you ever need extra information on other record types?
Its needed in some more locations: HEAP_HOT_UPDATE, HEAP2_MULTI_INSERT,
HEAP_NEWPAGE, HEAP_XACT_(ASSIGN, COMMIT, COMMIT_PREPARED, COMMIT_COMPACT,
ABORT, ABORT_PREPARED) and probably some I didn't remember right now.

Sure, we can add it to all those but then you need to have individual
knowledge about *all* of those because the location where its stored will be
different for each of them.

To recap why we think origin_id is a sensible design choice:

There are many sensible replication topologies where it does make sense that
you want to receive changes (on node C) from one node (say B) that originated
from some other node (say A).
Reasons include:
* the order of applying changes should be as similar as possible on all nodes.
That means when applying a change on C that originated on B and if changes
replicated faster from A->B than from A->C you want to be at least as far with
the replication from A as B was. Otherwise the conflict ratio will increase.
If you can recreate the stream from the wal of every node and still detect
where an individual change originated, thats easy.
* the interconnects between some nodes may be more expensive than from others
* an interconnect between two nodes may fail but others dont

Because of that we think its sensible to be able generate the full LCR stream
with all changes, local and remote ones, on each individual node. If you then
can filter on individual origin_id's you can build complex replication
topologies without much additional complexity.

> I'm not saying that we need to implement all possible conflict
> resolution algorithms right now - on the contrary I think conflict
> resolution belongs outside core - but if we're going to change the WAL
> record format to support such conflict resolution, we better make sure
> the foundation we provide for it is solid.
I think this already provides a lot. At some point we probably want to have
support for looking on which node a certain local xid originated and when that
was originally executed. While querying that efficiently requires additional
support we already have all the information for that.

There are some more complexities with consistently determining conflicts on
changes that happened in a very small timewindown on different nodes but thats
something for another day.

> BTW, one way to work around the lack of origin id in the WAL record
> header is to just add an origin-id column to the table, indicating the
> last node that updated the row. That would be a kludge, but I thought
> I'd mention it..
Yuck. The aim is to improve on whats done today ;)

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-06-20 19:16:35 Re: pgbench--new transaction type
Previous Message Alvaro Herrera 2012-06-20 19:15:13 Re: pl/perl and utf-8 in sql_ascii databases