Re: [noob] How to optimize this double pivot query?

From: Robert Buck <buck(dot)robert(dot)j(at)gmail(dot)com>
To: David Johnston <polobo(at)yahoo(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: [noob] How to optimize this double pivot query?
Date: 2012-10-02 01:13:21
Message-ID: CADf7wwXqQYEgfCdgAuz974EaHhE0-XjeUE2RaTPGskxo8Yfi6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

So as you can probably glean, the tables store performance metric data. The
reason I chose to use k-v is simply to avoid having to create an additional
column every time a new metric type come along. So those were the two
options I thought of, straight k-v and column for every value type.

Are there other better options worth considering that you could point me
towards that supports storing metrics viz. with an unbounded number of
metric types in my case?

Bob

On Mon, Oct 1, 2012 at 9:07 PM, David Johnston <polobo(at)yahoo(dot)com> wrote:

> *From:* pgsql-sql-owner(at)postgresql(dot)org [mailto:
> pgsql-sql-owner(at)postgresql(dot)org] *On Behalf Of *Robert Buck
> *Sent:* Monday, October 01, 2012 8:47 PM
> *To:* pgsql-sql(at)postgresql(dot)org
> *Subject:* [SQL] [noob] How to optimize this double pivot query?****
>
> ** **
>
> I have two tables that contain key-value data that I want to combine in
> pivoted form into a single result set. They are related to two separate
> tables.
>
> The tables are: test_results, test_variables, metric_def, metadata_key.
> The latter two tables are enum-like tables, basic descriptors of data
> stored in other tables. The former two tables are basically key-value
> tables (with ids as well); these k-v tables are related to the latter two
> tables via foreign keys.
>
> The following SQL takes about 11 seconds to run on a high-end laptop. The
> largest table is about 54k records, pretty puny.
>
> Can someone provide a hint as to why this is so slow? Again, I am a noob
> to SQL, so the SQL is probably poorly written.
>
> ****
>
> ** **
>
> Your query, while maybe not great, isn’t the cause of your problem. It is
> the table schema, specifically the “key-value” aspect, that is killing you.
> ****
>
> ** **
>
> You may want to try:****
>
> ** **
>
> SELECT *****
>
> FROM (SELECT id FROM …) id_master****
>
> NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype =
> ‘’) f1****
>
> NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype =
> ‘’) f2****
>
> [repeat one left join for every field; though you will then need to decide
> if/how to deal with NULL – not that you are currently doing anything
> special anyway…]****
>
> ** **
>
> Mainly the above avoids the use of “max()” and instead uses direct joins
> between the relevant tables. I have no clue whether that will improve
> things but if you are going to lie in this bed you should at least try
> different positions.****
>
> ** **
>
> The better option is to educate yourself on better ways of constructing
> the tables so that you do not have to write this kind of god-awful query.
> In some cases key-value has merit but usually only when done in
> moderation. Not for the entire database. You likely should simply have a
> table that looks like the result of the query below.****
>
> ** **
>
> As a second (not necessarily mutually exclusive) alternative: install and
> use the hstore extension.****
>
> ** **
>
> David J.****
>
> ** **
>
>
> Thanks in advance,
>
> Bob
>
> select
>
> t.id_name,
> max(t.begin_time) as begin_time,
> max(t.end_time) as end_time,
>
> max(case when (m.id_name = 'package-version') then v.value end) as
> package_version,
> max(case when (m.id_name = 'database-vendor') then v.value end) as
> database_vendor,
> max(case when (m.id_name = 'bean-name') then v.value end) as bean_name,
> max(case when (m.id_name = 'request-distribution') then v.value end)
> as request_distribution,
> max(case when (m.id_name = 'ycsb-workload') then v.value end) as
> ycsb_workload,
> max(case when (m.id_name = 'record-count') then v.value end) as
> record_count,
> max(case when (m.id_name = 'transaction-engine-count') then v.value
> end) as transaction_engine_count,
> max(case when (m.id_name = 'transaction-engine-maxmem') then v.value
> end) as transaction_engine_maxmem,
> max(case when (m.id_name = 'storage-manager-count') then v.value end)
> as storage_manager_count,
> max(case when (m.id_name = 'test-instance-count') then v.value end) as
> test_instance_count,
> max(case when (m.id_name = 'operation-count') then v.value end) as
> operation_count,
> max(case when (m.id_name = 'update-percent') then v.value end) as
> update_percent,
> max(case when (m.id_name = 'thread-count') then v.value end) as
> thread_count,
>
> max(case when (d.id_name = 'tps') then r.value end) as tps,
> max(case when (d.id_name = 'Memory') then r.value end) as memory,
> max(case when (d.id_name = 'DiskWritten') then r.value end) as
> disk_written,
> max(case when (d.id_name = 'PercentUserTime') then r.value end) as
> percent_user,
> max(case when (d.id_name = 'PercentCpuTime') then r.value end) as
> percent_cpu,
> max(case when (d.id_name = 'UserMilliseconds') then r.value end) as
> user_milliseconds,
> max(case when (d.id_name = 'YcsbUpdateLatencyMicrosecs') then r.value
> end) as update_latency,
> max(case when (d.id_name = 'YcsbReadLatencyMicrosecs') then r.value
> end) as read_latency,
> max(case when (d.id_name = 'Updates') then r.value end) as updates,
> max(case when (d.id_name = 'Deletes') then r.value end) as deletes,
> max(case when (d.id_name = 'Inserts') then r.value end) as inserts,
> max(case when (d.id_name = 'Commits') then r.value end) as commits,
> max(case when (d.id_name = 'Rollbacks') then r.value end) as rollbacks,
> max(case when (d.id_name = 'Objects') then r.value end) as objects,
> max(case when (d.id_name = 'ObjectsCreated') then r.value end) as
> objects_created,
> max(case when (d.id_name = 'FlowStalls') then r.value end) as
> flow_stalls,
> max(case when (d.id_name = 'NodeApplyPingTime') then r.value end) as
> node_apply_ping_time,
> max(case when (d.id_name = 'NodePingTime') then r.value end) as
> node_ping_time,
> max(case when (d.id_name = 'ClientCncts') then r.value end) as
> client_connections,
> max(case when (d.id_name = 'YcsbSuccessCount') then r.value end) as
> success_count,
> max(case when (d.id_name = 'YcsbWarnCount') then r.value end) as
> warn_count,
> max(case when (d.id_name = 'YcsbFailCount') then r.value end) as
> fail_count
>
> from test as t
>
> left join test_results as r on r.test_id = t.id
> left join test_variables as v on v.test_id = t.id
> left join metric_def as d on d.id = r.metric_def_id
> left join metadata_key as m on m.id = v.metadata_key_id
>
> group by t.id_name
>
> ;
>
> "GroupAggregate (cost=5.87..225516.43 rows=926 width=81)"
> " -> Nested Loop Left Join (cost=5.87..53781.24 rows=940964 width=81)"
> " -> Nested Loop Left Join (cost=1.65..1619.61 rows=17235
> width=61)"
> " -> Index Scan using test_uc on test t (cost=0.00..90.06
> rows=926 width=36)"
> " -> Hash Right Join (cost=1.65..3.11 rows=19 width=29)"
> " Hash Cond: (m.id = v.metadata_key_id)"
> " -> Seq Scan on metadata_key m (cost=0.00..1.24
> rows=24 width=21)"
> " -> Hash (cost=1.41..1.41 rows=19 width=16)"
> " -> Index Scan using test_variables_test_id_idx
> on test_variables v (cost=0.00..1.41 rows=19 width=16)"
> " Index Cond: (test_id = t.id)"
> " -> Hash Right Join (cost=4.22..6.69 rows=55 width=28)"
> " Hash Cond: (d.id = r.metric_def_id)"
> " -> Seq Scan on metric_def d (cost=0.00..1.71 rows=71
> width=20)"
> " -> Hash (cost=3.53..3.53 rows=55 width=16)"
> " -> Index Scan using test_results_test_id_idx on
> test_results r (cost=0.00..3.53 rows=55 width=16)"
> " Index Cond: (test_id = t.id)"****
>

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message David Johnston 2012-10-02 01:28:48 Re: [noob] How to optimize this double pivot query?
Previous Message David Johnston 2012-10-02 01:07:26 Re: [noob] How to optimize this double pivot query?