Re: Disable vacuuming to provide data history

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: marekmosiewicz(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Disable vacuuming to provide data history
Date: 2023-02-24 21:06:31
Message-ID: CADkLM=dt3se+oXEC6FZpTbj1ezaXnfyDm3XougO6mSG2KF69DQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 23, 2023 at 6:04 AM <marekmosiewicz(at)gmail(dot)com> wrote:

> Hey,
>
> It depnends on scenario, but there is many use cases that hack data
> change from somebody with admin privileges could be disaster.
> That is the place where data history could come with help. Some basic
> solution would be trigger which writes previous version of record
> to some other table. Trigger however can be disabled or removed (crazy
> solution would be to provide pernament
> triggers and tables which can only be pernamently inserted).
> Then we have also possibility to modify tablespace directly on disk.
>
> But Postgres has ability to not override records when two concurrent
> transaction modify data to provide MVCC.
>
> So what about pernamently not vacuumable tables. Adding some xid log
> tables with hash of record on hash on previous hash.
> I think that would be serious additional advantage for best open source
> relational databes.
>
> Best regards,
> Marek Mosiewicz
>

What you are describing sounds like the "system versioning" flavor of
"temporal" tables. It's a part of the SQL Standard, but PostgreSQL has yet
to implement it in core. Basically, every row has a start_timestamp and
end_timestamp field. Updating a row sets the end_timestamp of the old
version and inserts a new one with a start_timestamp matching the
end-timestamp of the previous row. Once a record has a non-null [1]
end_timestamp, it is not possible to update that row via SQL. Regular SQL
statements effectively have a "AND end_timestamp IS NULL" filter on them,
so the old rows are not visible without specifically invoking temporal
features to get point-in-time queries. At the implementation level, this
probably means a table with 2 partitions, one for live rows all having null
end_timestamps, and one for archived rows which is effectively append-only.

This strategy is common practice for chain of custody and auditing
purposes, either as a feature of the RDBMS or home-rolled. I have also seen
it used for developing forecasting models (ex "what would this model have
told us to do if we had run it a year ago?").

A few years ago, I personally thought about implementing a hash-chain
feature, but my research at the time concluded that:

* Few customers were interested in going beyond what was required for
regulatory compliance
* Once compliant, any divergence from established procedures, even if it
was an unambiguous improvement, only invited re-examination of it and
adjacent procedures, and they would avoid that
* They could get the same validation by comparing against a secured backup
and out-of-band audit "logs" (most would call them "reports")
* They were of the opinion that if a bad actor got admin access, it was
"game over" anyway

The world may have changed since then, but even if there is now interest, I
wonder if that isn't better implemented at the OS level rather than the
RDBMS level.

[1] some implementations don't use null, they use an end-timestamp set to
a date implausibly far in the future ( 3999-12-31 for example ), but the
concept remains that once the column is set to a real timestamp, the row
isn't visible to update statements.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2023-02-24 21:07:33 Re: BRIN indexes vs. SK_SEARCHARRAY (and preprocessing scan keys)
Previous Message Imseih (AWS), Sami 2023-02-24 20:54:00 Doc update for pg_stat_statements normalization