Bulk Insert/Update Scenario

From: Mana M <manan(dot)gcs(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Bulk Insert/Update Scenario
Date: 2018-01-04 19:39:41
Message-ID: CAJAiPv7CPhjmtExiv0e6F_DBAbOUBGmM=ViGsZdSeJE7c+fc=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I am setting up the data processing pipeline and the end result gets stored
in Postgres. Have not been a heavy DB user in general and had question
regarding how best to handle bulk insert/updates scenario with Postgres.

Here is my use case:
* I get file with thousands of entries (lines) periodically.
* I process each entry (line) from the file and data is split and stored in
different Postgres tables. Some tables have foreign keys on other tables.
There is "no" straight mapping from the entry in file to Postgres tables.
* Some data could be updates on existing rows in Postgres tables while
others could be inserts.
* Would like to ensure the atomicity (either all rows gets stored in all
tables or nothing gets stored on failure from Postgres).
* Also like to make sure no concurrency issues in case two different
processes try to perform above at the same time.
* Ideally, would want to avoid individual upserts after processing every
single entry (line) from the file.

I thought this would be a fairly common use case. What is the best way to
handle above? What performance issues I should keep in mind and what are
the pitfalls? I tried looking around for articles for such use case - any
pointers would be greatly appreciated.

By the way, the application is in Python running in Apache Spark and can
use any Python libraries that can help simplify above.

Thanks in advance.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jordan Deitch 2018-01-04 20:43:10 Re: Bulk Insert/Update Scenario
Previous Message Michelle Konzack 2018-01-04 19:02:11 Re: Production Database requirement