Re: Import large data set into a table and resolve duplicates?

From: Francisco Olarte <folarte(at)peoplecall(dot)com>
To: Eugene Dzhurinsky <jdevelop(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Import large data set into a table and resolve duplicates?
Date: 2015-02-16 13:01:51
Message-ID: CA+bJJbz5oxoq2GO62--ZpmrQc0cLaGQZgxnqUs7Wz4MydkgO0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Feb 15, 2015 at 7:11 PM, Eugene Dzhurinsky <jdevelop(at)gmail(dot)com>
wrote:

> On Sun, Feb 15, 2015 at 06:48:36PM +0100, Francisco Olarte wrote:
> > One thing that strikes me is you are either at the beginning of your
> usage
> > of this or you have A LOT of already present lines in the path ( I mean,
> > path has one fourth of the lines of dictionary ). Which is the case?
>
> Right now I'm on the very beginning stage, yes, and I expect to have "cache
> miss" for the dictionary at ratio of at least 70%, which eventually will
> drop
> down to 5-10%.
>

​Well, that gives one important data point ( the other being the expected
median/average size of your series entries ).

And this means the size of the problem is completely different, so the
solutions should be too. I mean, at a 70% hit rate with a 50k lines patch
you are expecting 35k insertions, and you even expect to have 2k5 / 5k
insertions when it drops. Not knowing any other num​ber, I'll urge you to
DISCARD any of my propossed solutions ( i.e, if you have 20 runs at 35k
followed by 100 runs at 5k you will end up with a
200+20*35+100*5=200+700+500=1400k rows table receiving 5k updates from a
50k file, which is a completely different problem than the original. Not
having any expected problem size, I will not propose any more.

> > When doing bulk operations ( like this, with a record count of 25% of the
> > whole file ) indexing is nearly always slowed than sorting and merging.
> As
> ​.....
>

So you suggest to take this off the Postgres?

​As stated above, I do not presently suggest anything, as I'm unable to do
it without further knowledge of the problem ( which, apart from the great
unknown of the problem size would also need how is the dictionary being to
be used, among other things ).​

Thats interesting. Simply put,
> I'll do a dump of the dictionary, sorted by series, to some file. Then sort
> the file with patch by series. Then merge the dictionary (left) and the
> patch
> (right) files. And during the merge if the (right) line doesn't have a
> corresponding (left) line, then put a nextval expression for sequence as
> an ID
> parameter. Then truncate existing dictionary table and COPY the data from
> the
> merged file into it.
>

​I would not truncate the dictionary, just put insert returning expressions
in the data file. Anyway, I would not presently suggest doing that, with
the data I have it's performance is a great unknown. With the original
problem size, it can be made highly performant, and I know batch processes
( sort, merge, etc...) easily beat ''interactive'' ( index lookup etc )
throughput wise ( and so total time wise ) on some sizes, but you need to
know the problem.

Also, when doing this, it depends on who updates the table and how. If only
the patches do it you can keep the table on the server and a sorted copy
out of it, and use the sorted copy plus the patch to prepare a set of
insert-returnings, which you then process and use ( their return value ) to
generate a new sorted copy ( this is normally very fast, and it can be made
robust, and most important, does not hit the server too much ( you use the
sorted copy plus the patch to generate a 'to_insert' file, then go to the
server, copy that into a to_insert temp table, insert it into dict, with
returning, inserting the result ( with ID's ) into an 'inserted' temp
table, then copy out the table, and then with the copy out (which you made
already sorted, or just sort again) you patch the dictionary copy. If you
ever think you've lost sync you just copy out the dictionary again and sort
it. ) This is batch-processing 101, but you need knowledge of the problem,
with the current data, I cannot recommend anything.

​I've been doing that sort of things for decades, and it works quite well,
but normally only bother doing it for ''big'' tasks, meaning multihour at
least. The thing is indexes seem quite fast as they give you a row very
fast, while the sort spends hours preparing, but after that the sorted rows
go in really fast, much faster than the index, and ends up first. As I say,
it depends on the problem. If you have a 5 min runtime doing straight
update/inserts ( not unreasonable for 200k/50k) to be done daily, it does
not make sense to make it more complex to make the runtime 15 secs. OTOH,
if you have a weekly 20h process which you can cut to 6h to fit it into a
nightly window, it does ( And I'm not exagerating, on the early 90s
changing indexes to sorting and batching allowed me to cut a 4.5 DAYS
process to 8 h. It did not seem to do anything for the first 7, but then it
went really fast ).

Anyway, dessign it according to your knowledge of the problem.

Regards.
Francisco Olarte.

> Is it what you've meant?
>
> Thank you!
>
> --
> Eugene Dzhurinsky
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Guillaume Drolet 2015-02-16 19:31:41 Starting new cluster from base backup
Previous Message Novák 2015-02-16 10:44:08 Fwd: Data corruption after restarting replica