Re: Approach to Data Summary and Analysis

From: Roxanne Reid-Bennett <rox(at)tara-lu(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Approach to Data Summary and Analysis
Date: 2014-04-16 15:42:02
Message-ID: 534EA4CA.5070401@tara-lu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-performance

On 4/15/2014 9:10 PM, Robert DiFalco wrote:
> 1. >500K rows per day into the calls table.
> 2. Very rarely. The only common query is gathering users that have not
> been called "today" (along with some other qualifying criteria). More
> analytical queries/reports are done for internal use and it is not
> essential that they be lickity-split.
> a. Usually just one connection at a time executes read queries.
> b. the users not called today query will be done once a day.
> c. Daily
> d. All users for the last year (if you are asking about retention). We
> will also rarely have to run for all time.
> e. Not that I'm aware of (or seen) today.
> f. For the simple queries we cannot afford latency between calls and
> querying who was already called.
>
> While I don't seem to be getting much support for it here :D my write
> performance (which is most essential) has been much better since I
> further normalized the tables and made it so that NULL is never used
> and data is never updated (i.e. it is immutable once it is written).

Based on the above you are primarily capturing data and feeding back
essentially one easy to find result set [who has NOT been successfully
called] on an ongoing single threaded basis [once per day?]. So you are
absolutely correct - tune for writing speed.

> The summary table was really a separate point from whether or not
> people liked my schema or not -- I mean whether I de-normalize as
> people are asking or not, there would still be the question of a
> summary table for MAX and COUNT queries or to not have a summary table
> for those. I probably made the original question too open ended.
>
Do you know your answer?
you said : "Occasionally I will want to know things like "
you answered to frequency on queries as "the users not called today
query will be done once a day." as was c) [I'm assuming once?]
and d) appears to be "ad-hoc" and you said your users can deal with
latency in response for those.

So finding Min/Max/Count quickly really *don't* matter for tuning.

So the only reason I can see to add a summary table is to ... simplify
maintenance [note I did NOT say "development"] and then only IF it
doesn't impact the write speeds beyond an acceptable level. Proper
internal / external documentation can mitigate maintenance nightmares.
If your developer(s) can't figure out how to get the data they need from
the schema - then give them the queries to run. [you are likely better
at tuning those anyway]

Last consideration - business consumption of data does change over
time. Disk space is cheap [but getting and keeping speed sometimes
isn't]. You might consider including ongoing partial archival of the
operational data during slow usage (write) periods.

Roxanne

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dev Kumkar 2014-04-16 15:44:39 Re: Heartbleed Impact
Previous Message Dev Kumkar 2014-04-16 15:38:16 Re: Heartbleed Impact

Browse pgsql-performance by date

  From Date Subject
Next Message Heikki Linnakangas 2014-04-16 15:57:21 Re: unneeded joins on view
Previous Message Linos 2014-04-16 15:13:24 unneeded joins on view