| From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Proposal: Conflict log history table for Logical Replication |
| Date: | 2025-12-02 06:36:37 |
| Message-ID: | CAFiTN-vq50N3QP9p3_SH+tJ8Pn=uRDb0X4qEcQZYcGW9AX88rQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Dec 2, 2025 at 11:38 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Dec 1, 2025 at 10:11 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > The specific scenario we are discussing is when a single row from the
> > publisher attempts to apply an operation that causes a conflict across
> > multiple unique keys, with each of those unique key violations
> > conflicting with a different local row on the subscriber, is very
> > rare. IMHO this low-frequency scenario does not justify
> > overcomplicating the design with an array field or a multi-level
> > table.
> >
>
> I did some analysis and search on the internet to answer your
> following two questions.
>
> > Consider the infrequency of the root causes:
> > - How often does a table have more than 3 to 4 unique keys?
>
> It is extremely common—in fact, it is considered the industry "best
> practice" for modern database design.
>
> One can find this pattern in almost every enterprise system (e.g.
> banking apps, CRMs). It relies on distinguishing between Technical
> Identity (for the database) and Business Identity (for the real
> world).
>
> 1. The Design Pattern: Surrogate vs. Natural Keys
> Primary Key (Surrogate Key): Usually a meaningless number (e.g.,
> 10452) or a UUID. It is used strictly for the database to join tables
> efficiently. It never changes.
> Unique Key (Natural Key): A real-world value (e.g., john(at)email(dot)com or
> SSN-123). This is how humans or external systems identify the row. It
> can change (e.g., someone updates their email).
>
> 2. Common Real-World Use Cases
> A. User Management (The most classic example)
> Primary Key: user_id (Integer). Used for foreign keys in the ORDERS table.
> Unique Key 1: email (Varchar). Prevents two people from registering
> with the same email.
> Unique Key 2: username (Varchar). Ensures unique display names.
> Why? If a user changes their email address, you only update one field
> in one table. If you used email as the Primary Key, you would have to
> update millions of rows in the ORDERS table that reference that email.
>
> B. Inventory / E-Commerce
> Primary Key: product_id (Integer). Used internally by the code.
> Unique Key: SKU (Stock Keeping Unit) or Barcode (EAN/UPC).
> Why? Companies often re-organize their SKU formats. If the SKU was the
> Primary Key, a format change would require a massive database
> migration.
>
> C. Government / HR Systems
> Primary Key: employee_id (Integer).
> Unique Key: National_ID (SSN, Aadhaar, Passport Number).
> Why? Privacy and security. You do not want to expose a National ID in
> every URL or API call (e.g., api/employee/552 is safer than
> api/employee/SSN-123).
>
> > - How frequently would each of these keys conflict with a unique row
> > on the subscriber side?
> >
>
> It can occur with medium-to-high probability in following cases. (a)
> In Bi-Directional replication systems; for example, If two users
> create the same "User Profile" on two different servers at the same
> time, the row will conflict on every unique field (ID, Email, SSN)
> simultaneously. (b) The chances of bloat are high, on retrying to fix
> the error as mentioned by Shveta. Say, if Ops team fixes errors by
> just "trying again" without checking the full row, you will hit the ID
> error, fix it, then immediately hit the Email error. (c) The chances
> are medium during initial data-load; If a user is loading data from a
> legacy system with "dirty" data, rows often violate multiple rules
> (e.g., a duplicate user with both a reused ID and a reused Email).
>
> > If resolving this occasional, synthetic conflict requires inserting
> > two or three rows instead of a single one, this is an acceptable
> > trade-off considering how rare it can occur.
> >
>
> As per above analysis and the re-try point Shveta raises, I don't
> think we can ignore the possibility of data-bloat especially for this
> multiple_unique_key conflict. We can consider logging multiple local
> conflicting rows as JSON Array.
Okay, I will try to make multiple local rows as JSON Array in the next version.
--
Regards,
Dilip Kumar
Google
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Kirill Reshke | 2025-12-02 06:37:08 | Re: Allow GUC settings in CREATE SUBSCRIPTION CONNECTION to take effect |
| Previous Message | Dilip Kumar | 2025-12-02 06:31:59 | Re: Parallel Apply |