Re: Loading 500m json files to database

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: Andrei Zhidenkov <andrei(dot)zhidenkov(at)n26(dot)com>
Cc: Ertan Küçükoğlu <ertan(dot)kucukoglu(at)1nar(dot)com(dot)tr>, pinker <pinker(at)onet(dot)eu>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Loading 500m json files to database
Date: 2020-03-23 13:31:00
Message-ID: 5944D9A7-C0AA-4F7E-992A-B6A78C7B9D9E@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> On Mar 23, 2020, at 5:59 AM, Andrei Zhidenkov <andrei(dot)zhidenkov(at)n26(dot)com> wrote:
>
> Try to write a stored procedure (probably pl/python) that will accept an array of JSON objects so it will be possible to load data in chunks (by 100-1000 files) which should be faster.
>
>>> On 23. Mar 2020, at 12:49, Ertan Küçükoğlu <ertan(dot)kucukoglu(at)1nar(dot)com(dot)tr> wrote:
>>>
>>>
>>>> On 23 Mar 2020, at 13:20, pinker <pinker(at)onet(dot)eu> wrote:
>>>
>>> Hi, do you have maybe idea how to make loading process faster?
>>>
>>> I have 500 millions of json files (1 json per file) that I need to load to
>>> db.
>>> My test set is "only" 1 million files.
>>>
>>> What I came up with now is:
>>>
>>> time for i in datafiles/*; do
>>> psql -c "\copy json_parts(json_data) FROM $i"&
>>> done
>>>
>>> which is the fastest so far. But it's not what i expect. Loading 1m of data
>>> takes me ~3h so loading 500 times more is just unacceptable.
>>>
>>> some facts:
>>> * the target db is on cloud so there is no option to do tricks like turning
>>> fsync off
>>> * version postgres 11
>>> * i can spin up huge postgres instance if necessary in terms of cpu/ram
>>> * i tried already hash partitioning (to write to 10 different tables instead
>>> of 1)
>>>
>>>
>>> Any ideas?
>> Hello,
>>
>> I may not be knowledge enough to answer your question.
>>
>> However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restore on cloud system.
>>
>> Compressed backup will be far less internet traffic compared to direct data inserts.
>>
>> Moreover you can do additional tricks as you mentioned.
>>
>> Thanks & regards,
>> Ertan
>>
>>

Drop any and all indices

>>
>>
>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Laurenz Albe 2020-03-23 13:42:29 Re: Passwordcheck configuration
Previous Message Andrei Zhidenkov 2020-03-23 11:59:37 Re: Loading 500m json files to database