Re: [PROPOSAL] Backup and recovery of pg_statistic

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Dmitry Ivanov <d(dot)ivanov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PROPOSAL] Backup and recovery of pg_statistic
Date: 2015-12-23 03:16:05
Message-ID: 6221.1450840565@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> On Fri, Nov 27, 2015 at 06:52:59PM +0300, Dmitry Ivanov wrote:
>> Now that the approach has been developed, it may be applied to improve the
>> 'pg_dump' utility. Some minor code changes would make the 'pg_dump' emit
>> specially-formed recovery INSERTS for 'pg_statistic' in the 'binary-upgrade'
>> mode, thus allowing us to restore saved stats after an upgrade.

> I think this is great. My only question for the list is how much does
> this dumping create new compatibility requirements between major
> versions.

That concern is exactly the reason why we never did this originally.
In particular, emitting raw INSERTs into pg_statistic is just plain
foolish; we have changed the rowtype of pg_statistic in the past and
are likely to do so again. At a minimum we would need such a facility
to be proof against addition of more statistic "slots" (more columns)
to pg_statistic.

While at Salesforce, I did some work on this problem, and came up
with code that emitted calls to functions that would insert data for
a single "slot". That is, the dump consisted of calls to functions
defined more or less like this:

pg_load_statistics_slot (table_oid regclass,
attr_name name,
slot_kind int,
slot_op oid,
slot_numbers float4[],
slot_values anyarray);

and there was another one to load the non-slot columns of a pg_statistic
entry. And of course there was code to emit such a dump, producing one
dump statement per occupied "slot" in pg_statistic plus one call to
the other function per pg_statistic row.

An API like this seems a good deal more future-proof than plain INSERTs.
Even if the underlying pg_statistic representation changes, such functions
could try to adapt the presented data to the new format, or at worst they
could be redefined as no-ops.

The major problem that I never had a very good solution for is how the
"load" function could validate that the presented data was actually valid
for the specified table attribute's data type and slot kind. The absolute
minimum that you would want it to do is to cross-check that the
slot_values array has the correct element datatype, because if it doesn't
the backend is just about certain to crash when it tries to use the data.
I think there are probably other assumptions that would need to be
validated to be safe, depending on the code that looks at each slot kind.
I have not found an answer other than the load function knowing all there
is to know about each defined slot kind; which seems like a maintenance
problem, not to mention breaking extensions (like PostGIS) that define
their own slot kinds.

Failing any good solution to that, we could probably get by by restricting
use of these functions to superusers (which is certainly where direct use
of INSERTs would have to stop). But that puts a pretty considerable crimp
in the usefulness of the stats dump/load process. Ideally, ordinary users
could use this facility to transfer statistics for their own tables, just
as they can use pg_dump ... but without adequate validity checking, it's
way too much of a security hazard to allow that.

I don't have access to this code anymore, but could probably persuade
Salesforce to donate it. Or we could just write something similar
from scratch.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-12-23 03:25:03 Re: Possible marginally-incompatible change to array subscripting
Previous Message Thomas Munro 2015-12-23 03:15:05 Re: Support for N synchronous standby servers - take 2