Re: Loading 500m json files to database

From: Christopher Browne <cbbrowne(at)gmail(dot)com>
To: pinker <pinker(at)onet(dot)eu>
Cc: PostgreSQL Mailing Lists <pgsql-general(at)postgresql(dot)org>
Subject: Re: Loading 500m json files to database
Date: 2020-03-23 14:12:08
Message-ID: CAFNqd5UWCrnh6o4R1-sQ=Kr-HLoK0M5d16HJ=y-9vM+Je-7oqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 23 Mar 2020 at 06:24, pinker <pinker(at)onet(dot)eu> wrote:

> Hi, do you have maybe idea how to make loading process faster?
>
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
>
> What I came up with now is:
>
> time for i in datafiles/*; do
> psql -c "\copy json_parts(json_data) FROM $i"&
> done
>
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.
>
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables
> instead
> of 1)
>
>
> Any ideas?
>

Well, you're paying for a lot of overhead in that, as you're
establishing a psql command, connecting to a database, spawning a backend
process, starting a transactions, committing a transaction, closing the
backend
process, disconnecting from the database, and cleaning up after the
launching
of the psql command. And you're doing that 500 million times.

The one thing I left off that was the loading of a single tuple into
json_parts.

What you could do to improve things quite a lot would be to group some
number
of those files together, so that each time you pay for the overhead, you at
least
get the benefit of loading several entries into json_parts.

So, loosely, I'd commend using /bin/cat (or similar) to assemble several
files together
into one, and then \copy that one file in.

Having 2 tuples loaded at once drops overhead by 50%
Having 10 tuples loaded at once drops overhead by 90%
Having 100 tuples loaded at once drops overhead by 99%
Having 1000 tuples loaded at once drops overhead by 99.9%

There probably isn't too much real value to going past 1000 tuples per
batch; the
overhead, by that point, is getting pretty immaterial.

Reducing that overhead is the single most important thing you can do.

It is also quite likely that you could run such streams in parallel,
although
it would require quite a bit more information about the I/O capabilities of
your
hardware to know if that would do any good.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Radu Radutiu 2020-03-23 14:46:02 Runtime partition pruning
Previous Message Laurenz Albe 2020-03-23 13:42:29 Re: Passwordcheck configuration