Re: [noob] How to optimize this double pivot query?

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Robert Buck <buck(dot)robert(dot)j(at)gmail(dot)com>
Cc: Thomas Kellerer <spam_eater(at)gmx(dot)net>, "pgsql-sql(at)postgresql(dot)org" <pgsql-sql(at)postgresql(dot)org>
Subject: Re: [noob] How to optimize this double pivot query?
Date: 2012-10-02 19:57:08
Message-ID: CAEV0TzDuvC=f=iGkEvSyEUBsfR0xAgUfoHxt6sXrM+avBYeihA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Tue, Oct 2, 2012 at 2:45 AM, Robert Buck <buck(dot)robert(dot)j(at)gmail(dot)com> wrote:

> Hi Samuel
>
> Thank you. This may be a bit of a stretch for you, but would it be
> possible for me to peek at a sanitized version of your cross tab query, for
> a good example on how to do this for this noob?
>
> This will be pretty common in my case. The biggest tables will get much
> larger as they are raw metrics feeds, which at some point need to be fed
> through reporting engines to analyze and spot regressions.
>
> Lastly, am I simply using the wrong tech for data feeds and analytics? The
> first cut of this used flat files and R and though it scoured thousands of
> files was much faster than the SQL I wrote here. The big goal was to get
> this off disk and into a database, but as its highly variable, very sparse,
> metric data, this is why I chose k-v. SQL databases are internally more
> politically acceptable, though I am personally agnostic on the matter. In
> the end it would be nice to directly report off a database, but so long as
> I can transform to csv I can always perform reporting and analytics in R,
> and optionally map and reduce natively in Ruby. Sane? Ideas? This is early
> on, and willing to adjust course and find a better way if suggestions
> indicate such. I've heard a couple options so far.
>

OK, you owe me. I think I just spent more than an hour writing this up ;-)

Given the numbers of rows you are talking about, I can't think of any good
reason why the database shouldn't do what you need it to do pretty
effectively/quickly.

Some questions before I provide crosstab samples - your example query has a
hardcoded set of keys that it is looking for, but maybe that was
code-generated. There are multiple forms of the crosstab function, some of
which rely on a constant column count and others which can generate the set
of columns based on another query. There are complications in that second
form relating to race conditions with regard to new keys showing up between
the query for the columns and the query for the data, so it is important to
understand that and either structure queries accordingly or make sure you
execute in a transaction with sufficient transaction isolation to prevent
the race condition from appearing.

For crosstab queries, you generally want a query that returns results in
the following form:

row_name | key | value

where row_name would be a date or some other label that all values that
should share a row will have in common. Key is the field that defines a
column in the final row. And value is the value to go in that column. If
it is the case that you always want ALL keys, then you can simply do a left
join to ensure that you get a row for every key, regardless of whether
there is an actual value. You can use COALESCE to turn nulls resulting
from left joins that don't match into '0' values. If there is the
potential for multiple rows with the same key and row_name, you need to use
an aggregate function (like max or avg) to compress all rows into a single
row.

The crosstab(text sql) form of the function requires that it gets a value
for every possible column in every single row. It makes some naive
assumptions about the presence of a value for every spot in the matrix. It
also cannot include extra columns as plain payload in the final cross tab
result. For greater flexibility, you really want to use the crosstab(text
sql, text sql) form of the function. This allows you to specify the set of
columns that should appear via the 2nd query. When processing the results,
it makes fewer naive assumptions about the presence of data for every
column, so it correctly handles missing data (and, I assume, data for
columns that didn't appear in the 2nd query). This eliminates the need for
left joins to get empty rows, though you do still need to use aggregates to
combine multiple rows into one row.

Your 2nd query would be a query that simply returns the distinct set of key
names from both of your key/value tables. You want them in a consistent
and repeatable order, so order by id_name:

select id_name from (
select distinct m.id_name from metadata_key m
union
select distinct d.id_name from metric_def d
) q order by id_name

The nested union query is required based on what I can gleam of your schema
from your examples, since you are getting keys from multiple tables and we
need them all in a single column in a consistent order.

Now you just need a query that returns

row_name | extra_col1 | extra_col2 | key | max(value)

extra_col1 and extra_col2 are going to be the begin_time and end_time from
your test table, but you are going to have to do the union thing again in
order to extract results from multiple tables into a single column, and
you'll have to do the subquery thing in order to get the rows in consistent
order again:

select row_name, begin_time, end_time, category, value from
(
select t.id_name as row_name, max(t.begin_time) as begin_time,
max(t.end_time) as end_time,
m.id_name as category, max(v.value_end) as value
from test t, metadata_key m, test_variables v
where v.test_id = t.id and m.id = v.metadata_key_id group by 1, 4

union all -- use 'union all' to prevent the database from trying to de-dupe
rows

select t.id_name as row_name, max(t.begin_time) as begin_time,
max(t.end_time) as end_time,
d.id_name as category, max(r.value_end) as value
from test t, test_results r, metric_def d
where v.test_id = r.test_id and d.id = r.metric_def_id
) q
order by row_name, category

order by is important, as the crosstab processor is naive - it processes a
row until a new row_name is found, then moves on. If it encounters the
same row_name later on, it will create a new row with the same name rather
than going back and adding columns to the existing row. I don't think
category order is important in this case, since it is doing category lookup
by name rather than position, but all of my example code does order
categories, too, probably just to be safe.

Now there is one final bit of wierdness related to the crosstab functions.
You must specify the structure of the returned results when you call it,
because it returns a setof record.

So you either call it like this:

select * from crosstab(sql1, sql2) as myresult(text rowname, timestamp
begin_time, timestamp end_time, int col1, int col2, int col3, int col4, int
col5,...)

or if you know that you will have exactly 35 columns every single time, you
can declare an alias to the function which specifies the structure of the
returned records, which will allow you to call it simply as:

select * from crosstab35(sql, sql);

The documentation for the tablefunc module (
http://www.postgresql.org/docs/9.1/static/tablefunc.html ) explains how to
set up the aliases you might want if you have a consistent column set. If
you don't have a consistent column set, you have to first query for the
names of all possible columns. Then construct your two queries for the
crosstab function, along with the 'as' clause, using the names that were
returned from your initial call to limit the results in the eventual
crosstab execution. In other words, you need to make sure that the 2nd sql
statement in the crosstab call cannot possibly return more columns than the
set of columns you define in the 'as' clause, so you must either execute
both queries in a transaction with at least REPEATABLE_READ isolation
semantics, or else include a where clause in your column query that limits
the columns returned to the same set of columns you got in the first query
- select column_name from table where column_name in (...).

Basically, the flow looks like this:

select all possible column names from database.

build data query for crosstab func.
build column query for crosstab func (should be the same or similar to the
earlier query for column names).
build 'as' clause for crosstab func - I just name my columns a1, a2, a3,
a4, etc, knowing that they will be in the same order as my column names,
since my column names are often not strings that would be legal column
names in a query.

execute query "select * from crosstab(" + data_query + "," + column_query +
")" + as_clause
process crosstab result

I'll admit, it's a little difficult to set up correctly the first time you
do it. But execution is actually pretty efficient, though it is only doing
pretty much exactly what you'd do if you were to pivot the data manually
while iterating over the resultset for the data query. If you're working
in a a dynamically typed language that lets you easily write code that has
efficient hash maps and the like, and you only have this one use case, it
might actually be easier just to query for the raw data and pivot it
yourself, where you'll have much more flexibility in dealing with column
names and such. But if you have a wide variety of uses, it's probably
worthwhile to become familiar with the crosstab functionality, since it is
pretty simple to use once you get the hang of it. I've got fairly
boilerplate code that deals with things like generating the record
structure so I can usually add a new bit of crosstab functionality to an
app pretty darn quickly these days.

One thing that's a drag is that the queries are just strings as far as the
database is concerned, so depending upon your client library, you may not
be able to use variable escaping best practices. Basically, jdbc won't
allow you to do use prepared statement variables within a string, so you
have to manually escape any user-provided input while constructing the
query as a raw string, rather than using '?' in your query. The same goes
for trying to slip a crosstab query though hibernate's SQLQuery interface
(I thought that might be a workaround, but it wasn't), so you are forced to
construct the entire query string manually.

--sam

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Samuel Gendler 2012-10-02 20:09:58 Re: [noob] How to optimize this double pivot query?
Previous Message David Johnston 2012-10-02 13:53:29 Re: Help in accessing array