Re: [noob] How to optimize this double pivot query?

From: Robert Buck <buck(dot)robert(dot)j(at)gmail(dot)com>
To: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
Cc: Thomas Kellerer <spam_eater(at)gmx(dot)net>, "pgsql-sql(at)postgresql(dot)org" <pgsql-sql(at)postgresql(dot)org>
Subject: Re: [noob] How to optimize this double pivot query?
Date: 2012-10-03 16:23:36
Message-ID: CADf7wwW9eYnS443QDFPnfQgZCz0KbFa0Dc+2UKKuhmhVTDSyfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Thank you, Samuel.

I am trying some of this out right now...

This is great information.

Thanks so much. This is a huge help.

Bob

On Tue, Oct 2, 2012 at 4:09 PM, Samuel Gendler <sgendler(at)ideasculptor(dot)com>wrote:

> One last comment - an alternative solution to this is to use array_agg(),
> which will simply add each value that matches a group by clause to an
> array. If you are careful to structure your query with left joins so that
> every grouping will have the same number of rows, then you can get a quick
> and dirty crosstab like this
>
> select row_name, array_agg(value) from test t left join value_table v on
> t.id = v.id group by 1;
>
> Obviously, you'll want to deal with your potential for duplicate rows via
> max() or avg() in a subquery, rather than joining directly to the table,
> but you should get the idea from that example. You can also use coalesce
> to convert nulls to some other value, if required.
>
> Since the crosstab functions already require you to do all that work with
> regard to determining column names and building up the record structure,
> using array_agg can be every bit as effective, since it basically requires
> the same process. First query for all possible names, then issue a query
> that will cause the values to be processed by array_agg in column order,
> then iterate over results, getting each array value and associating it with
> a particular name.
>
> Your result will look like this:
>
> id_name, start_time, end_time, array_of_values
>
> That may or may not be convenient for you, depending upon how you are
> using the resultset you get back. You'll still need to play all the same
> games with regard to unioning multiple queries together to pivot data from
> multiple tables into the same row.
>
>
>
> On Tue, Oct 2, 2012 at 12:57 PM, Samuel Gendler <sgendler(at)ideasculptor(dot)com
> > wrote:
>
>>
>>
>> On Tue, Oct 2, 2012 at 2:45 AM, Robert Buck <buck(dot)robert(dot)j(at)gmail(dot)com>wrote:
>>
>>> Hi Samuel
>>>
>>> Thank you. This may be a bit of a stretch for you, but would it be
>>> possible for me to peek at a sanitized version of your cross tab query, for
>>> a good example on how to do this for this noob?
>>>
>>> This will be pretty common in my case. The biggest tables will get much
>>> larger as they are raw metrics feeds, which at some point need to be fed
>>> through reporting engines to analyze and spot regressions.
>>>
>>> Lastly, am I simply using the wrong tech for data feeds and analytics?
>>> The first cut of this used flat files and R and though it scoured thousands
>>> of files was much faster than the SQL I wrote here. The big goal was to get
>>> this off disk and into a database, but as its highly variable, very sparse,
>>> metric data, this is why I chose k-v. SQL databases are internally more
>>> politically acceptable, though I am personally agnostic on the matter. In
>>> the end it would be nice to directly report off a database, but so long as
>>> I can transform to csv I can always perform reporting and analytics in R,
>>> and optionally map and reduce natively in Ruby. Sane? Ideas? This is early
>>> on, and willing to adjust course and find a better way if suggestions
>>> indicate such. I've heard a couple options so far.
>>>
>>
>> OK, you owe me. I think I just spent more than an hour writing this up
>> ;-)
>>
>> Given the numbers of rows you are talking about, I can't think of any
>> good reason why the database shouldn't do what you need it to do pretty
>> effectively/quickly.
>>
>> Some questions before I provide crosstab samples - your example query has
>> a hardcoded set of keys that it is looking for, but maybe that was
>> code-generated. There are multiple forms of the crosstab function, some of
>> which rely on a constant column count and others which can generate the set
>> of columns based on another query. There are complications in that second
>> form relating to race conditions with regard to new keys showing up between
>> the query for the columns and the query for the data, so it is important to
>> understand that and either structure queries accordingly or make sure you
>> execute in a transaction with sufficient transaction isolation to prevent
>> the race condition from appearing.
>>
>> For crosstab queries, you generally want a query that returns results in
>> the following form:
>>
>> row_name | key | value
>>
>> where row_name would be a date or some other label that all values that
>> should share a row will have in common. Key is the field that defines a
>> column in the final row. And value is the value to go in that column. If
>> it is the case that you always want ALL keys, then you can simply do a left
>> join to ensure that you get a row for every key, regardless of whether
>> there is an actual value. You can use COALESCE to turn nulls resulting
>> from left joins that don't match into '0' values. If there is the
>> potential for multiple rows with the same key and row_name, you need to use
>> an aggregate function (like max or avg) to compress all rows into a single
>> row.
>>
>> The crosstab(text sql) form of the function requires that it gets a value
>> for every possible column in every single row. It makes some naive
>> assumptions about the presence of a value for every spot in the matrix. It
>> also cannot include extra columns as plain payload in the final cross tab
>> result. For greater flexibility, you really want to use the crosstab(text
>> sql, text sql) form of the function. This allows you to specify the set of
>> columns that should appear via the 2nd query. When processing the results,
>> it makes fewer naive assumptions about the presence of data for every
>> column, so it correctly handles missing data (and, I assume, data for
>> columns that didn't appear in the 2nd query). This eliminates the need for
>> left joins to get empty rows, though you do still need to use aggregates to
>> combine multiple rows into one row.
>>
>> Your 2nd query would be a query that simply returns the distinct set of
>> key names from both of your key/value tables. You want them in a
>> consistent and repeatable order, so order by id_name:
>>
>> select id_name from (
>> select distinct m.id_name from metadata_key m
>> union
>> select distinct d.id_name from metric_def d
>> ) q order by id_name
>>
>> The nested union query is required based on what I can gleam of your
>> schema from your examples, since you are getting keys from multiple tables
>> and we need them all in a single column in a consistent order.
>>
>> Now you just need a query that returns
>>
>> row_name | extra_col1 | extra_col2 | key | max(value)
>>
>> extra_col1 and extra_col2 are going to be the begin_time and end_time
>> from your test table, but you are going to have to do the union thing again
>> in order to extract results from multiple tables into a single column, and
>> you'll have to do the subquery thing in order to get the rows in consistent
>> order again:
>>
>> select row_name, begin_time, end_time, category, value from
>> (
>> select t.id_name as row_name, max(t.begin_time) as begin_time,
>> max(t.end_time) as end_time,
>> m.id_name as category, max(v.value_end) as value
>> from test t, metadata_key m, test_variables v
>> where v.test_id = t.id and m.id = v.metadata_key_id group by 1, 4
>>
>> union all -- use 'union all' to prevent the database from trying to
>> de-dupe rows
>>
>> select t.id_name as row_name, max(t.begin_time) as begin_time,
>> max(t.end_time) as end_time,
>> d.id_name as category, max(r.value_end) as value
>> from test t, test_results r, metric_def d
>> where v.test_id = r.test_id and d.id = r.metric_def_id
>> ) q
>> order by row_name, category
>>
>> order by is important, as the crosstab processor is naive - it processes
>> a row until a new row_name is found, then moves on. If it encounters the
>> same row_name later on, it will create a new row with the same name rather
>> than going back and adding columns to the existing row. I don't think
>> category order is important in this case, since it is doing category lookup
>> by name rather than position, but all of my example code does order
>> categories, too, probably just to be safe.
>>
>> Now there is one final bit of wierdness related to the crosstab
>> functions. You must specify the structure of the returned results when you
>> call it, because it returns a setof record.
>>
>> So you either call it like this:
>>
>> select * from crosstab(sql1, sql2) as myresult(text rowname, timestamp
>> begin_time, timestamp end_time, int col1, int col2, int col3, int col4, int
>> col5,...)
>>
>> or if you know that you will have exactly 35 columns every single time,
>> you can declare an alias to the function which specifies the structure of
>> the returned records, which will allow you to call it simply as:
>>
>> select * from crosstab35(sql, sql);
>>
>> The documentation for the tablefunc module (
>> http://www.postgresql.org/docs/9.1/static/tablefunc.html ) explains how
>> to set up the aliases you might want if you have a consistent column set.
>> If you don't have a consistent column set, you have to first query for the
>> names of all possible columns. Then construct your two queries for the
>> crosstab function, along with the 'as' clause, using the names that were
>> returned from your initial call to limit the results in the eventual
>> crosstab execution. In other words, you need to make sure that the 2nd sql
>> statement in the crosstab call cannot possibly return more columns than the
>> set of columns you define in the 'as' clause, so you must either execute
>> both queries in a transaction with at least REPEATABLE_READ isolation
>> semantics, or else include a where clause in your column query that limits
>> the columns returned to the same set of columns you got in the first query
>> - select column_name from table where column_name in (...).
>>
>> Basically, the flow looks like this:
>>
>> select all possible column names from database.
>>
>> build data query for crosstab func.
>> build column query for crosstab func (should be the same or similar to
>> the earlier query for column names).
>> build 'as' clause for crosstab func - I just name my columns a1, a2, a3,
>> a4, etc, knowing that they will be in the same order as my column names,
>> since my column names are often not strings that would be legal column
>> names in a query.
>>
>> execute query "select * from crosstab(" + data_query + "," + column_query
>> + ")" + as_clause
>> process crosstab result
>>
>> I'll admit, it's a little difficult to set up correctly the first time
>> you do it. But execution is actually pretty efficient, though it is only
>> doing pretty much exactly what you'd do if you were to pivot the data
>> manually while iterating over the resultset for the data query. If you're
>> working in a a dynamically typed language that lets you easily write code
>> that has efficient hash maps and the like, and you only have this one use
>> case, it might actually be easier just to query for the raw data and pivot
>> it yourself, where you'll have much more flexibility in dealing with column
>> names and such. But if you have a wide variety of uses, it's probably
>> worthwhile to become familiar with the crosstab functionality, since it is
>> pretty simple to use once you get the hang of it. I've got fairly
>> boilerplate code that deals with things like generating the record
>> structure so I can usually add a new bit of crosstab functionality to an
>> app pretty darn quickly these days.
>>
>> One thing that's a drag is that the queries are just strings as far as
>> the database is concerned, so depending upon your client library, you may
>> not be able to use variable escaping best practices. Basically, jdbc won't
>> allow you to do use prepared statement variables within a string, so you
>> have to manually escape any user-provided input while constructing the
>> query as a raw string, rather than using '?' in your query. The same goes
>> for trying to slip a crosstab query though hibernate's SQLQuery interface
>> (I thought that might be a workaround, but it wasn't), so you are forced to
>> construct the entire query string manually.
>>
>> --sam
>>
>>
>>
>

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Robert Buck 2012-10-03 16:31:52 Re: [noob] How to optimize this double pivot query?
Previous Message Samuel Gendler 2012-10-02 20:09:58 Re: [noob] How to optimize this double pivot query?