Quick Links

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	<tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<andres(at)2ndQuadrant(dot)com>,<simon(at)2ndQuadrant(dot)com>, <aakash(dot)bits(at)gmail(dot)com>, <josh(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: xReader, double-effort (was: Temporary tables under hot standby)
Date:	2012-04-29 14:27:29
Message-ID:	4F9D09810200002500047540@gw.wicourts.gov
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Tom Lane wrote:
> "Kevin Grittner" writes:

>> Well, the xReader approach (Aakash's GSoC project) is to serve as
>> a proxy for a WAL stream going to a hot standby, to interpret each
>> incoming WAL record by "cheating" and querying the HS before
>> passing the WAL along, and then using sync rep to the HS
>> regardless of whether the incoming stream is sync or async. What
>> reliability problem do you see with this approach?
>
> Well, first off, it requires a replica that *is* a physical replica
> (else it won't have the same table OIDs, for instance). I'm not
> sure I get the point of logical replication that requires a
> physical replica as a prerequisite.

Maybe I can help with that by describing what the Wisconsin court
system does for circuit court data. Due to state law, state supreme
court rules, and the general suckiness of the WAN available to us,
each of Wisconsin's 72 counties has a database server in the county
seat which is the "official" data of record. We have two parallel
forms of backup -- WAL-based and logical.

For WAL-based replication we maintain a copy of the last two weekly
base backups and all WAL files needed to recover from either of them
to any point in time. One copy of this is required to be in the
county, on a separate system with separate drives. The other copy is
maintained centrally. One rather large server has a hot standby of
all county databases, currently fed through copying WAL files
(although we plan to move to streaming this year). These "physical"
backups are the primary source for recovering from catastrophic
failure of a county server.

We have software to "top off" the recovered database with
transactions from the logical transaction stream as needed, since
that is more up-to-date. We have also found this combination of
physical and logical transactions useful when someone trying to do a
data fix with direct SQL statements against a production database
mangled things badly. We use PITR recovery up to just before the
damage, then apply logical transactions from that point forward,
skipping the problem transactions.

Until recently our logical transaction stream was generated from our
Java middle tier running on the database servers. This has recently
been converted to use a generalized PostgreSQL trigger, written in C,
which is attached to all tables for which we want logical replication
(about 90% of them). The trigger captures data into transaction
tables in the database as part of the same transaction they are
representing, so that the generation of the logical transaction data
is atomic with the represented data. We didn't switch to one of the
existing solutions because of all the code downstream which uses the
logical transaction streams -- we didn't want to change the format or
delivery method of the logical steam because it's simply not feasible
to change everything at once.

The way the logical stream is currently delivered is still through
the aforementioned Java middle tier. It reads the logical
transaction log tables and sends the same data in non-persistent JMS
messages through queues to the "TRAN client" at the central location,
just as we did for years before we ever heard of PostgreSQL.
Normally, the TRAN client reads each logical transaction once, and
feeds many "replication targets" at the central site. These include:

- four "Central Information Repository" (CIR) databases each of
which contains data from all 72 counties (two machines each at
two different sites)

- four "Database Transaction Repository" (DTR) databases on the
same servers as the CIR databases (but separate RAIDs) This is
used as a supplementary source for recovery (as mentioned
above), as well as for auditing what happened on a case when
there is any question, and certain activity reports.

- a filter to identify transactions of possible interest to the
publishers we use for interfaces to other agencies (district
attorneys, public defenders, department of corrections, police
agencies, social service agencies, etc.) Transactions of
interest are queued for closer review by the publisher software,
which actually posts messages if the transaction is found to
actually be of interest.

- other databases for the state court's "central services" for
such things as court interpreters used for non-English speakers
in the various counties

The number and varied nature of the targets is significant, because
the "fan out" and timing differences need to be considered. The TRAN
client normally receives one copy of the logical data for each
transaction, and about 99% of the time the transaction's data is
replicated to all databases (and queued for review by publishers)
within a second of someone pressing Enter 300 miles away. It's that
other 1% that gets really interesting.

To deal with the possibility that one or more central targets were
down or running very slowly, the TRAN client will give up on keeping
them at the same point in the transaction stream after a
(configurable) delay. Since this replication is important for
disaster recovery, we don't want problems on one server to keep
others from staying up-to-date. As currently configured, for every
set of targets which falls behind, there is a redundant stream of
logical transactions across the WAN.

Let's talk about the WAN. We're somewhat "compelled" to use the WAN
offered by the state's Department of Administration. It is
expensive, so we use the slowest speed that will work for each
county. That can be as low as 750 Mbps up to 10 Gbps, and even at
those pitiful rates, this was costing us (last I heard) over $80,000
per month -- so boosting WAN speed is a last resort. Besides being
the conduit for both the WAL and logical transaction streams, this
same WAN is used for file server backups, access to the Internet for
court staff, and inter-county database access (such as District Court
Administrators or Chief Judges access all counties in their
respective districts).

Sometimes a WAN pipe gets saturated and the transaction streams for a
county fall behind. This can be aggravated by other traffic (such as
when all court employees in a county are asked to view a training
video via the Internet), or by large raster-scanned documents being
added to the database. In these cases the "let the lagging target
spawn its own transaction stream" approach currently becomes
pathological. On the saturated connection an ACK of a message from
one or more targets may be delayed long enough that it splits the
stream and makes a bad situation even worse.

We have an item on our annual plan to try to address these problems,
which two people were assigned to start working on next month. The
xReader technology proposed by Aakash is so much better than what we
otherwise can do, that when management heard about it, they delayed
that project in hopes that by waiting a few months, it can be done
with xReader. The goal is to move to streaming replication and have
that be the one stream of transaction data flowing over the WAN from
the counties to the central location. From this stream we want to
generate the logical transactions to feed all the other systems,
while maintaining the WAL-based HS as the primary source for disaster
recovery. It is important that, at least initially, we can produce
logical transactions in the format accepted by all these other
internal systems.

The plan *was* to continue to fire the triggers to capture the data
on the county servers, but read those tables on the HS. That has
many obvious disadvantages compared to either of the proposed logical
replication systems. But for our purposes, generating logical
transactions at the central site rather than at the original database
is far superior, in that it means we only send the data over the WAN
once, not twice.

> Next, it breaks immediately in the face of DDL:
>
> CREATE TABLE foo (f1 int);
> BEGIN;
> ALTER TABLE foo ALTER COLUMN f1 RENAME TO f2;
> INSERT INTO foo (f2) VALUES (1);
> COMMIT;
>
> The standby is not going to think that the ALTER is committed, so
> it will not report the right column name when it comes time to
> translate the INSERT. Actually, you can break it even more easily
> than that:
>
> CREATE TABLE bar AS SELECT ...
>
> What will you do with the insertions executed by this CREATE? They
> have to be executed before the creation of table bar is committed.
>
> Also, you'd need a query connection per database (at least, maybe
> one per session if you were trying to track DDL effects), which
> seems rather a lot of load on the HS slave. That together with the
> requirement for synchronous operation seems absolutely catastrophic
> from a performance standpoint.

Good points, but I think they all yield to one solution. If we can
generate the appropriate snapshot based on the WAL record we are
looking at, those problems all evaporate, don't they? It seems to me
that we could build on the "shared snapshot" work to provide a means
to this end.

> If you want to spend the summer building a toy, fine, but I don't
> see this going anywhere for production purposes.

I, the whole DBA group here, and the managers are very excited about
putting such a technology into production here. Enough to probably
be willing to find a way to do it under version 9.1 this year if at
all possible.

> It would be enormously more performant for the master to be
> emitting logical replication records to start with, since it
> already has all the right names etc at hand at basically no cost.

Not when the consumers are across a WAN, and that WAN is the biggest
performance bottleneck and the most expensive resource involved.

I will cheerfully agree that both approaches have valuable use-cases.

-Kevin

Responses

Re: Re: xReader, double-effort (was: Temporary tables under hot standby) at 2012-04-29 17:02:47 from Simon Riggs
Re: Re: xReader, double-effort (was: Temporary tables under hot standby) at 2012-05-03 04:24:38 from Jim Nasby

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2012-04-29 15:06:58	Re: smart shutdown at end of transaction (was: Default mode for shutdown)
Previous Message	Kevin Grittner	2012-04-29 12:40:58	Re: default_transaction_isolation = serializable causes crash under Hot Standby