Re: Insert performance for large transaction with multiple COPY FROM

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Horst Dehmer <horst(dot)dehmer(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Insert performance for large transaction with multiple COPY FROM
Date: 2013-01-17 18:12:06
Message-ID: CAMkU=1ykVF4Ry6guyhFDsHr-JJhtwC35+fsPfxUQMri-7bw_+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Jan 15, 2013 at 3:44 PM, Horst Dehmer <horst(dot)dehmer(at)gmail(dot)com> wrote:

> idx_4 together with a simple select in the tables on-insert trigger is
> slowing things down considerably.

So the theory is that the presence of idx_4 is causing the trigger to
pick a poor plan (i.e. one using idx_4) while its absence removes that
temptation?

> pg_statio comes up with same big numbers (reads = bad, hits = not so bad?):

True disk reads are much more expensive, but given how few reads you
have relative to hits, I now think that in aggregate the hits are more
of a concern than the reads are. In other words, you seem to be CPU
bound, not IO bound.

Even more so I think (but not with much confidence) that most of your
"reads" are actually coming from the OS cache and not from the disk.
PG cannot distinguish true disk reads from OS cache reads.

When was the last time you reset the stats? That is, are your
reported numbers accumulated over several loads, with some having idx4
and some not?

...
>
> Now I have some (more) questions:
>
> 1. How do I know which index (if any) is chosen for a select statement
> inside a trigger during a bulk load transaction? (or for that matter: a
> series of recursive plpgsql functions)

Informally, reset your database to the state it was in before the
load, analyze it, and do the explain again before you do the load.

More formally, use use auto_explain and set
auto_explain.log_nested_statements to true. I haven't verified this
works with triggers, just going by the description I think it should.

> 2. The query planner depends on stats collected by auto-vacuum/vacuum
> analyze, right? Does stats collecting also happen during a lengthy
> transaction?

My understanding is that a transaction will not dump its stats until
the commit, so the auto analyze will not occur *due to* the lengthy
transaction until after it is over. But if the table was already due
for analyze anyway due to previous or concurrent shorter transactions,
the analyze will happen. However, the lengthy transaction might not
see the results of the analyze (I'm not clear on the transaction
snapshot semantics of the statistics tables) and even if it did see
them, it might just be using cached plans and so would not change the
plan in the middle.

> 3. Is it possible (or even advisable) to trigger vacuum analyze inside an
> ongoing transaction. Let's say load 10,000 rows of table A, analyze table A,
> insert the next 10,000 rows, analyze again, ...

You can't vacuum inside a transaction. You can analyze, but I don't
know if it would be advisable.

Your use case is a little unusual. If you are bulk loading into an
initially empty table, usually you would remove the trigger and add it
after the load (with some kind of bulk operation to make up for
whatever it was the trigger would have been doing). On the other
hand, if you are bulk loading into a "live" table and so can't drop
the trigger, then the live table should have good-enough preexisting
statistics to make the trigger choose a good plan.

Cheers,

Jeff

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Horst Dehmer 2013-01-18 23:15:15 Re: Insert performance for large transaction with multiple COPY FROM
Previous Message Ali Pouya 2013-01-17 16:03:50 Re: Performance on Bulk Insert to Partitioned Table