| From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Proposal: Conflict log history table for Logical Replication |
| Date: | 2025-12-02 07:08:01 |
| Message-ID: | CAFiTN-u3+zRGPESP5kUUfa6NxaWh1HL-gd1225KJ0Uvzi1urow@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Dec 2, 2025 at 12:06 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Dec 2, 2025 at 11:38 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Dec 1, 2025 at 10:11 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > The specific scenario we are discussing is when a single row from the
> > > publisher attempts to apply an operation that causes a conflict across
> > > multiple unique keys, with each of those unique key violations
> > > conflicting with a different local row on the subscriber, is very
> > > rare. IMHO this low-frequency scenario does not justify
> > > overcomplicating the design with an array field or a multi-level
> > > table.
> > >
> >
> > I did some analysis and search on the internet to answer your
> > following two questions.
> >
> > > Consider the infrequency of the root causes:
> > > - How often does a table have more than 3 to 4 unique keys?
> >
> > It is extremely common—in fact, it is considered the industry "best
> > practice" for modern database design.
> >
> > One can find this pattern in almost every enterprise system (e.g.
> > banking apps, CRMs). It relies on distinguishing between Technical
> > Identity (for the database) and Business Identity (for the real
> > world).
> >
> > 1. The Design Pattern: Surrogate vs. Natural Keys
> > Primary Key (Surrogate Key): Usually a meaningless number (e.g.,
> > 10452) or a UUID. It is used strictly for the database to join tables
> > efficiently. It never changes.
> > Unique Key (Natural Key): A real-world value (e.g., john(at)email(dot)com or
> > SSN-123). This is how humans or external systems identify the row. It
> > can change (e.g., someone updates their email).
> >
> > 2. Common Real-World Use Cases
> > A. User Management (The most classic example)
> > Primary Key: user_id (Integer). Used for foreign keys in the ORDERS table.
> > Unique Key 1: email (Varchar). Prevents two people from registering
> > with the same email.
> > Unique Key 2: username (Varchar). Ensures unique display names.
> > Why? If a user changes their email address, you only update one field
> > in one table. If you used email as the Primary Key, you would have to
> > update millions of rows in the ORDERS table that reference that email.
> >
> > B. Inventory / E-Commerce
> > Primary Key: product_id (Integer). Used internally by the code.
> > Unique Key: SKU (Stock Keeping Unit) or Barcode (EAN/UPC).
> > Why? Companies often re-organize their SKU formats. If the SKU was the
> > Primary Key, a format change would require a massive database
> > migration.
> >
> > C. Government / HR Systems
> > Primary Key: employee_id (Integer).
> > Unique Key: National_ID (SSN, Aadhaar, Passport Number).
> > Why? Privacy and security. You do not want to expose a National ID in
> > every URL or API call (e.g., api/employee/552 is safer than
> > api/employee/SSN-123).
> >
> > > - How frequently would each of these keys conflict with a unique row
> > > on the subscriber side?
> > >
> >
> > It can occur with medium-to-high probability in following cases. (a)
> > In Bi-Directional replication systems; for example, If two users
> > create the same "User Profile" on two different servers at the same
> > time, the row will conflict on every unique field (ID, Email, SSN)
> > simultaneously. (b) The chances of bloat are high, on retrying to fix
> > the error as mentioned by Shveta. Say, if Ops team fixes errors by
> > just "trying again" without checking the full row, you will hit the ID
> > error, fix it, then immediately hit the Email error. (c) The chances
> > are medium during initial data-load; If a user is loading data from a
> > legacy system with "dirty" data, rows often violate multiple rules
> > (e.g., a duplicate user with both a reused ID and a reused Email).
> >
> > > If resolving this occasional, synthetic conflict requires inserting
> > > two or three rows instead of a single one, this is an acceptable
> > > trade-off considering how rare it can occur.
> > >
> >
> > As per above analysis and the re-try point Shveta raises, I don't
> > think we can ignore the possibility of data-bloat especially for this
> > multiple_unique_key conflict. We can consider logging multiple local
> > conflicting rows as JSON Array.
>
> Okay, I will try to make multiple local rows as JSON Array in the next version.
>
Just to clarify so that we are on the same page, along with the local
tuple the other local fields like local_xid, local_commit_ts,
local_origin will also be converted into the array. Hope that makes
sense?
So we will change the table like this, not sure if this makes sense to
keep all local array fields nearby in the table, or let it be near the
respective remote field, like we are doing now remote_xid and local
xid together etc.
Column | Type | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------+---------
relid | oid | | |
schemaname | text | | |
relname | text | | |
conflict_type | text | | |
local_xid | xid[] | | |
remote_xid | xid | | |
remote_commit_lsn | pg_lsn | | |
local_commit_ts | timestamp with time zone[] | | |
remote_commit_ts | timestamp with time zone | | |
local_origin | text[] | | |
remote_origin | text | | |
key_tuple | json | | |
local_tuple | json[] | | |
remote_tuple | json | | |
--
Regards,
Dilip Kumar
Google
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Michael Paquier | 2025-12-02 07:09:58 | Re: Refactor how we form HeapTuples for CatalogTuple(Insert|Update) |
| Previous Message | Ashutosh Bapat | 2025-12-02 06:56:15 | Re: SQL Property Graph Queries (SQL/PGQ) |