Re: BUG #13846: INSERT ON CONFLICT consumes sequencers onconflicts

From: Paul <paul(at)salesintel(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13846: INSERT ON CONFLICT consumes sequencers onconflicts
Date: 2016-01-06 21:00:14
Message-ID: 568d8060.291cb60a.e8a06.ffff9d00@mx.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

To be clear, the column defaulted to a sequencer is not at all involved in determining conflicts, and therefore it should not matter if it is used in a unique index, nor would there be any problems with a BEFORE trigger.

Gaps aren’t any kind of issue, and it’s not about how popular a related table is to the table upon which the UPSERT would be applied.

For example, I might be receiving event data from a shop floor where everything is instrumented; tank levels, step counters, etc. There could be millions of events over a relatively short period of time. I might have a field in the event data which indicates the ‘class’ of device that generated the event; ex: ‘pump’, ‘actuator’, ‘regulator’, ‘thermometer’. There may only be less than a few hundred classes of devices that rarely change. So I have a table, with a surrogate SMALLINT key defaulted to a sequence, and a ‘name’ column which is a device class name.

As the raw event data is processed, I want to do an UPSERT into the device class table, using only the class name to resolve conflicts, to get it’s surrogate key, which is then used while inserting the related event record linking that event record to the particular device class. Since device classes are rarely added, it will be the common case that an actual INSERT is almost never performed.

Using INSERT ON CONFLICT in this case, however, would mean the device class table, which would only ever hold a few hundred records and would only require a SMALLINT for its surrogate key, would have to have a BIGINT for no good reason, which really isn’t a big deal. What IS the big deal is that the event table would also require a BIGINT for the device class column to relate the event to the device class. In most cases, there are many more ‘organizing’ columns, similar to the device class in this example, meaning BIGINTs would have to be used for no good reason within event tables, which can hold vast volumes of data. This can have a real and measurable impact on memory, network, and storage requirements, which eventually map to dollars.

This basic scenario is very common, and one that greatly benefits from UPSERT capability; i.e. linking event data to organizational data. It’s much less common to have to UPSERT into some event table; i.e. the INSERT part of UPSERTs is usually the path less taken. I used the example of a shop floor with device classes, but its the same issue for looking at orders by product hierarchies, commissions by territories, stars by classification, web-page-hits by marketing campaigns…. I could list hundreds of processes within many problem domains that are of this same basic pattern.

My only point is, there’s another great capability in Postgres, for doing efficient concurrent UPSERTS, but that one of the most common and prime use cases for it is going to cost more money from having to use 64bit numbers everywhere for no good reason.

From: Andres Freund
Sent: Wednesday, January 6, 2016 11:19 AM
To: Paul Hester
Cc: Peter Geoghegan; pgsql-bugs
Subject: Re: [BUGS] BUG #13846: INSERT ON CONFLICT consumes sequencers onconflicts

On 2016-01-05 22:04:41 -0700, Paul Hester wrote:
> No peek-ahead necessary, when the sequenced column would not be used in the
> conflict_target. UPSERTS to dimension tables in a star schema as part of an
> OLAP system are very handy, but typically the records in the dimension
> tables use a surrogate key based on sequences for maintenance & performance
> reasons, and the 'natural' key/value is the only column that would be used
> to detect a conflict. In this case, the 'normal' path would be the
> conflict_action nearly all the time, and not the insert. If the typical
> path was the INSERT, I'd bet the data would be from some event source, and
> I guess in that case if there were a key column defaulted with a sequence,
> like an order#, it would have to be used to resolve conflicts. But it's
> hard to imagine a case in a process flow where you wouldn't already know if
> the event existed or not, limiting the need for an UPSERT on event data.

It'd be very fragile. You could only do that if the value isn't involved
in any unique indexes, if there are no BEFORE triggers (since they need
to see the value).

> This limitation, of consuming sequencers used for defaults on columns not
> necessary to resolve conflict, diminishes the viability for using Postges
> UPSERTS for large data warehouses, is all I'm saying (or requires surrogate
> keys being 64 bit). Just caught me by surprise in comparison to other
> RDBMSs offering some form of UPSERTs that would not consume a sequencer if
> its values weren't required to resolve conflicts.

If you use normal sequences you already need to cope with gaps in
sequences. And ids needing to be 64bits if you're a longer lived
business and it's for a a halfway "popular" table, isn't something new.

So I don't think addressing this is worth adding code for, especially
given the fragility of the situations where it'd be usable.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2016-01-06 21:13:14 Re: BUG #13846: INSERT ON CONFLICT consumes sequencers onconflicts
Previous Message eugeneymail 2016-01-06 20:42:50 BUG #13852: SQL Select Slow Issues