Re: [noob] How to optimize this double pivot query?

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Robert Buck <buck(dot)robert(dot)j(at)gmail(dot)com>
Cc: Thomas Kellerer <spam_eater(at)gmx(dot)net>, "pgsql-sql(at)postgresql(dot)org" <pgsql-sql(at)postgresql(dot)org>
Subject: Re: [noob] How to optimize this double pivot query?
Date: 2012-10-03 20:05:50
Message-ID: CAEV0TzCHcA_oA2KOKFMn3y_sWYoS56a+4sPHRMqHdgpxOdoMCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Wed, Oct 3, 2012 at 9:31 AM, Robert Buck <buck(dot)robert(dot)j(at)gmail(dot)com> wrote:

> Samuel,
>
> You asked:
>
> Some questions before I provide crosstab samples - your example query has
> a hardcoded set of keys that it is looking for, but maybe that was
> code-generated. There are multiple forms of the crosstab function, some of
> which rely on a constant column count and others which can generate the set
> of columns based on another query. There are complications in that second
> form relating to race conditions with regard to new keys showing up between
> the query for the columns and the query for the data, so it is important to
> understand that and either structure queries accordingly or make sure you
> execute in a transaction with sufficient transaction isolation to prevent
> the race condition from appearing.
>
> The query was hand written from examples I found on the web. The list here
> was hand written too. This will run in Ruby, so I can template as much as
> needed on that side. I can also run this in a defined transaction boundary
> rather than using auto-commit. Right now I am thumbing through your email
> and trying it out.
>

Actually, it isn't sufficient just to run within a defined transaction,
since the default isolation level of a transaction in postgresql doesn't
guarantee that you'll see the same rows in two queries during the same
transaction. If another transaction commits between your queries, you'll
see the new rows in the 2nd query. You need to have your transaction
isolation set to 'serializable' in order to prevent phantom reads. The
default isolation level is 'read committed' which says you'll only see rows
that have been committed, but there's no guarantee that two queries will
get the same values if another transaction commits a change between them.

For details, see this:

http://www.postgresql.org/docs/9.1/static/transaction-iso.html

*In general, I find that it is easier to set up my crosstab processing to
generate consistent column sets by including a where clause in all but the
first query such that the where clause always matches the first
query.* That won't work if column categories can be deleted, since
you'll still
get a different column set from one query to the next, but the vast
majority of my apps don't allow actual row deletion, or it is at least
incredibly rare and only done by administrative users, not end-users of the
app. Generally, rows representing domain entities get marked as disabled,
so tows going away isn't usually a factor for me. And it allows me to not
have to worry that someone who comes along and re-uses my code won't set
their transaction isolation up correctly when running the crosstab query
and will get whacky results.

Note that if you do use serializable transactions to do this, you should
use a read-only transaction to cut down on locking overhead and the
possibility that other transactions will fail due to serialization errors
while your crosstab is running.

This all seems terribly complex when describing it, but it really is pretty
easy once you've done it once. The column-count/serializable thing applies
to using array_agg to build a crosstab, too, since you still need to ensure
that your arrays all have the same number of entries in the same order.

--sam

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message air 2012-10-04 19:31:49 Calling the CTE for multiple inputs
Previous Message Robert Buck 2012-10-03 16:31:52 Re: [noob] How to optimize this double pivot query?