Re: Improving performance of merging data between tables

From: Andy Colson <andy(at)squeakycode(dot)net>
To: Pawel Veselov <pawel(dot)veselov(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Improving performance of merging data between tables
Date: 2014-12-30 14:12:02
Message-ID: 54A2B2B2.9050706@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 12/29/2014 11:29 PM, Pawel Veselov wrote:
>
> Andy,
>
> thanks for looking into this.
>
> On Mon, Dec 29, 2014 at 9:00 AM, Andy Colson <andy(at)squeakycode(dot)net <mailto:andy(at)squeakycode(dot)net>> wrote:
>
> On 12/28/2014 3:49 PM, Pawel Veselov wrote:
>
> Hi.
>
> I was wondering if anybody would have any ideas on how to improve
> certain operations that we are having.
>
> <SNIP>
>
>
> Besides "can somebody please look at this and let me know if I'm doing
> something utterly stupid", here are my questions.
>
> 1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
> function? All I see is that the calls to merge_all() function take long
> time, and the CPU is high while this is going on.
>
>
>
> First, I'll admit I didn't read your entire post.
>
> I can think of a couple methods:
>
> 1) try each of the statements in merge_all by hand with an "explain analyze" in front to see which is slow. Look for things that hit big tables without an index. Check that fk lookups are indexes.
>
>
> If I didn't miss anything, that seems to be OK, even on function-based queries.
>
> 2) try pg_stat_statements, setting "pg_stat_statements.track = all". see:
> http://www.postgresql.org/__docs/9.4/static/__pgstatstatements.html <http://www.postgresql.org/docs/9.4/static/pgstatstatements.html>
>
> I have used this to profile some functions, and it worked pretty well. Mostly I use it on a test box, but once ran it on the live, which was scary, but worked great.
>
>
> That looks promising. Turned it on, waiting for when I can turn the server at the next "quiet time".
>
> 3) try auto-explain:
> http://www.postgresql.org/__docs/9.4/static/auto-explain.__html <http://www.postgresql.org/docs/9.4/static/auto-explain.html>
>
> I've never used it, so don't know if it'll show each statement inside a function. Dumps stuff to the log AFAIK, so you'll have to dig out the info by hand.
>
> > 2) Is there a better way to merge individual rows, except doing
> > UPDATE/INSERT in a loop, and would that be CPU expensive?
> >
>
> Not that I know of. I use pretty much the same thing. Soon! we will have merge/upsert support. Hopefully it'll be fast.
>
>
> Well, anytime I cancelled the PID that was executing this whole mess, it would always stop at UPDATE ... SET ... WHERE on the main table. Which does make me believe that bulk update would really help.
>
> > 3) Is there a better way to merge whole tables? However, note that I
> > need to translate primary keys from node main table into the common main
> > table, as they are used as foreign keys, hence the loops. I suspect the
> > looping is CPU intensive.
>
> Avoiding loops and doing things as sets is the best way. If possible. The only loop I saw was looping over the merge_xxx tables, which is probably the only way.
>
>
> There is an endless loop that is just a device for merging, but then there are loops going over each record in all the tables that are being merge, feeding them into the function that actually does the merge. That table iteration is what I want to eliminate (especially if I knew it would help :) )
>
> If possible (if you haven't already) you could add and extra column to your secondary table that you can set as the main table's key.
>
> bulk insert into second;
> update second set magic = (select key from main where ... );
>
> Then, maybe, you can do two ops in batch:
>
> update main (where key exists in main)
> insert into main (where key not exists in main)
>
>
> I was thinking along the same lines. I can't really do bulk insert, at any point, because any key can be inserted by another process at any time, and with a good probability. However, there will be a lot less inserts than updates. So, in general, I'm making it do this:
>

What about transactions? I assume you do something like:

begin;
merge_all;
commit;

Depending on your transaction isolation level, then you could ensure that nobody could insert while you are inserting. I've never used the different isolation levels, so not 100% sure. Even then, maybe a lock on the table itself, like:

begin;
lock;
merge_all;
unlock;
commit;

This way only one at a time can work do insert/update, but you can do them in batch and not in a loop. It might be faster that way. Other processes might wait for the lock a little bit, but if merge_all was faster in general, the time to wait for lock would be less that the entire merge process itself.

I'm totally guessing here.

-Andy

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2014-12-30 14:20:06 Re: [HACKERS] ON_ERROR_ROLLBACK
Previous Message Bernd Helmle 2014-12-30 12:35:50 Re: [HACKERS] ON_ERROR_ROLLBACK