Quick Links

Re: Global snapshots

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-14 11:20:59
Message-ID:	25966165-21FE-4C9C-A5DD-4E31D82487FB@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> On 9 May 2018, at 17:51, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Ouch. That's not so bad at READ COMMITTED, but at higher isolation
> levels failure becomes extremely likely. Any multi-statement
> transaction that lasts longer than global_snapshot_defer_time is
> pretty much doomed.

Ouch indeed. Current xmin holding scheme has two major drawbacks: it introduces
timeout between export/import of snapshot and holds xmin in pessimistic way, so
old versions will be preserved even when there were no global transactions. On
a positive side is simplicity: that is the only way which I can think of that
doesn't require distributed calculation of global xmin, which in turn, will
probably require permanent connection to remote postgres_fdw node. It is not
hard to add some background worker to postgres_fdw that will hold permanent
connection, but I afraid that it is very discussion-prone topic and that's why
I tried to avoid that.

> I don't think holding back xmin is a very good strategy. Maybe it
> won't be so bad if and when we get zheap, since only the undo log will
> bloat rather than the table. But as it stands, holding back xmin
> means everything bloats and you have to CLUSTER or VACUUM FULL the
> table in order to fix it.

Well, opened local transaction in postgres holds globalXmin for whole postgres
instance (with exception of STO). Also active global transaction should hold
globalXmin of participating nodes to be able to read right versions (again,
with exception of STO).
However, xmin holding scheme itself can be different. For example we can
periodically check (lets say every 1-2 seconds) oldest GlobalCSN on each node
and delay globalXmin advancement only if there is really exist some long
transaction. So the period of bloat will be limited by this 1-2 seconds, and
will not impose timeout between export/import.
Also, I want to note, that global_snapshot_defer_time values about
of tens of seconds doesn't change much in terms of bloat comparing to logical
replication. Active logical slot holds globalXmin by setting
replication_slot_xmin, which is advanced on every RunningXacts, which in turn
logged every 15 seconds (hardcoded in LOG_SNAPSHOT_INTERVAL_MS).

> If the behavior were really analogous to our existing "snapshot too
> old" feature, it wouldn't be so bad. Old snapshots continue to work
> without error so long as they only read unmodified data, and only
> error out if they hit modified pages.

That is actually a good idea that I missed, thanks. Really since all logic
for checking modified pages is already present, it is possible to reuse that
and don't raise "Global STO" error right when old snapshot is imported, but
only in case when global transaction read modified page. I will implement
that and update patch set.

Summarising, I think, that introducing some permanent connections to
postgres_fdw node will put too much burden on this patch set and that it will
be possible to address that later (in a long run such connection will be anyway
needed at least for a deadlock detection). However, if you think that current
behavior + STO analog isn't good enough, then I'm ready to pursue that track.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Re: Global snapshots at 2018-05-09 14:51:25 from Robert Haas

Responses

Re: Global snapshots at 2018-05-15 12:53:52 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Rowley	2018-05-14 11:29:17	Re: Incorrect comment in get_partition_dispatch_recurse
Previous Message	Ashutosh Bapat	2018-05-14 11:14:23	Re: [HACKERS] advanced partition matching algorithm for partition-wise join