Re: Improving performance of merging data between tables

From: Pawel Veselov <pawel(dot)veselov(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Improving performance of merging data between tables
Date: 2014-12-30 05:29:59
Message-ID: CAMnJ+BfQOBzv8L_VzdKeWWXgJW1Y2pjXq6EHu2DpvqPpZpfFNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Andy,

thanks for looking into this.

On Mon, Dec 29, 2014 at 9:00 AM, Andy Colson <andy(at)squeakycode(dot)net> wrote:

> On 12/28/2014 3:49 PM, Pawel Veselov wrote:
>
>> Hi.
>>
>> I was wondering if anybody would have any ideas on how to improve
>> certain operations that we are having.
>>
>> <SNIP>
>
>>
>> Besides "can somebody please look at this and let me know if I'm doing
>> something utterly stupid", here are my questions.
>>
>> 1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
>> function? All I see is that the calls to merge_all() function take long
>> time, and the CPU is high while this is going on.
>>
>>
>>
> First, I'll admit I didn't read your entire post.
>
> I can think of a couple methods:
>
> 1) try each of the statements in merge_all by hand with an "explain
> analyze" in front to see which is slow. Look for things that hit big
> tables without an index. Check that fk lookups are indexes.
>

If I didn't miss anything, that seems to be OK, even on function-based
queries.

> 2) try pg_stat_statements, setting "pg_stat_statements.track = all". see:
> http://www.postgresql.org/docs/9.4/static/pgstatstatements.html
>
> I have used this to profile some functions, and it worked pretty well.
> Mostly I use it on a test box, but once ran it on the live, which was
> scary, but worked great.
>

That looks promising. Turned it on, waiting for when I can turn the server
at the next "quiet time".

> 3) try auto-explain:
> http://www.postgresql.org/docs/9.4/static/auto-explain.html
>
> I've never used it, so don't know if it'll show each statement inside a
> function. Dumps stuff to the log AFAIK, so you'll have to dig out the info
> by hand.
>
> > 2) Is there a better way to merge individual rows, except doing
> > UPDATE/INSERT in a loop, and would that be CPU expensive?
> >
>
> Not that I know of. I use pretty much the same thing. Soon! we will
> have merge/upsert support. Hopefully it'll be fast.

Well, anytime I cancelled the PID that was executing this whole mess, it
would always stop at UPDATE ... SET ... WHERE on the main table. Which does
make me believe that bulk update would really help.

> > 3) Is there a better way to merge whole tables? However, note that I
> > need to translate primary keys from node main table into the common main
> > table, as they are used as foreign keys, hence the loops. I suspect the
> > looping is CPU intensive.
>
> Avoiding loops and doing things as sets is the best way. If possible. The
> only loop I saw was looping over the merge_xxx tables, which is probably
> the only way.
>

There is an endless loop that is just a device for merging, but then there
are loops going over each record in all the tables that are being merge,
feeding them into the function that actually does the merge. That table
iteration is what I want to eliminate (especially if I knew it would help
:) )

> If possible (if you haven't already) you could add and extra column to
> your secondary table that you can set as the main table's key.
>
> bulk insert into second;
> update second set magic = (select key from main where ... );
>
> Then, maybe, you can do two ops in batch:
>
> update main (where key exists in main)
> insert into main (where key not exists in main)
>
>
I was thinking along the same lines. I can't really do bulk insert, at any
point, because any key can be inserted by another process at any time, and
with a good probability. However, there will be a lot less inserts than
updates. So, in general, I'm making it do this:

with pivot as ( select main_table.id, node_table.id as node_id as main_id
from node_table left join main_table using (key fields) )
update node_table set translate_id = pivot.main_id where node_table.id =
pivot.node_id;

(missing is cursor as select from node_table where main_id is null)

for row in missing loop
-- merge_function will return PK of either the updated, or inserted
record.
-- use (0) data values, so there it's an identity update, if the merge
results
-- into an update, or "empty" data if not.
select merge_function(missing.key_fields, 0) into use_id;
update node_table set translate_id = use_id where current of missing;
end loop

At this point, I have a guarantee that I can update all records, and there
is nothing to insert.
So,

with new as ( select * from node_table )
update main_table old
set new.val = f(old.val, new.val)
where new.translate_id = old.id

So, I don't need full key matching anymore, I can use PKs instead.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bernd Helmle 2014-12-30 12:35:50 Re: [HACKERS] ON_ERROR_ROLLBACK
Previous Message John Casey 2014-12-30 04:51:05 bdr_init_copy fails when starting 2nd BDR node