Quick Links

Re: Loading 500m json files to database

From:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To:	pinker <pinker(at)onet(dot)eu>
Cc:	pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Loading 500m json files to database
Date:	2020-03-24 01:11:28
Message-ID:	CAKFQuwbhc02XWAJ9DNyao5M1EUf=ECxMQX0U+RftgAeDJNmKTw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Mon, Mar 23, 2020 at 3:24 AM pinker <pinker(at)onet(dot)eu> wrote:

> time for i in datafiles/*; do
> psql -c "\copy json_parts(json_data) FROM $i"&
> done
>

Don't know whether this is faster but it does avoid spinning up a
connection multiple times.

#bash, linux
function append_each_split_file_to_etl_load_script() {
for filetoload in ./*; do
ronumber="$(basename $filetoload)"
# only process files since subdirs can be present
if [[ -f "$filetoload" ]]; then
echo ""
echo "\set invoice"' `cat '"'""$filetoload""'"'`'
echo ", ('$ronumber',:'invoice')"
fi >> "$PSQLSCRIPT"
done

echo "" >> "$PSQLSCRIPT"
echo ";" >> "$PSQLSCRIPT"
echo "" >> "$PSQLSCRIPT"
}

There is a bit other related code that is needed (for my specific usage)
but this is the core of it. Use psql variables to capture the contents of
each file into a variable and then just perform a normal insert
(specifically, a VALUES (...), (...) variant). Since you can intermix psql
and SQL you basically output a bloody long script, that has memory issues
at scale - but you can divide and conquer - and then "psql --file
bloody_long_script_part_1_of_100000.psql".

David J.

In response to

Loading 500m json files to database at 2020-03-23 10:24:48 from pinker

Responses

Re: Loading 500m json files to database at 2020-03-24 01:51:27 from Rob Sargent

Browse pgsql-general by date

	From	Date	Subject
Next Message	Rob Sargent	2020-03-24 01:51:27	Re: Loading 500m json files to database
Previous Message	Adrian Klaver	2020-03-24 00:50:46	Re: Loading 500m json files to database