Re: Approach to Data Summary and Analysis

From: Robert DiFalco <robert(dot)difalco(at)gmail(dot)com>
To: Roxanne Reid-Bennett <rox(at)tara-lu(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Approach to Data Summary and Analysis
Date: 2014-04-16 18:40:06
Message-ID: CAAXGW-w+T890fuf+J3O8h4DyXf6UZnSdVv0XXuWPd_eW3jO=4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-performance

Thanks Roxanne, I suppose when it comes down to it -- for the current use
cases and data size -- my only concern is the "calling" query that will
need to use max to determine if a user has already had a call today. For a
large user set, for each user I would either have to MAX on the answered
timestamp to compare it against today or do an exist query to see if any
timestamp for that user is greater or equal than "today".

But I suppose I just need to construct a huge dataset and see. I was
thinking by keeping a summary so I always knew the last answer or call time
for each user that I could mitigate this becoming an issue. Over time a
single user could have answered a call thousands of times. So that would
make a "<=" timestamp query be just # of users instead of # of users X 1000
(or however many calls they have answered over the non-archived time
period).

On Wed, Apr 16, 2014 at 8:42 AM, Roxanne Reid-Bennett <rox(at)tara-lu(dot)com>wrote:

> On 4/15/2014 9:10 PM, Robert DiFalco wrote:
>
>> 1. >500K rows per day into the calls table.
>> 2. Very rarely. The only common query is gathering users that have not
>> been called "today" (along with some other qualifying criteria). More
>> analytical queries/reports are done for internal use and it is not
>> essential that they be lickity-split.
>> a. Usually just one connection at a time executes read queries.
>> b. the users not called today query will be done once a day.
>> c. Daily
>> d. All users for the last year (if you are asking about retention). We
>> will also rarely have to run for all time.
>> e. Not that I'm aware of (or seen) today.
>> f. For the simple queries we cannot afford latency between calls and
>> querying who was already called.
>>
>> While I don't seem to be getting much support for it here :D my write
>> performance (which is most essential) has been much better since I further
>> normalized the tables and made it so that NULL is never used and data is
>> never updated (i.e. it is immutable once it is written).
>>
>
> Based on the above you are primarily capturing data and feeding back
> essentially one easy to find result set [who has NOT been successfully
> called] on an ongoing single threaded basis [once per day?]. So you are
> absolutely correct - tune for writing speed.
>
>
> The summary table was really a separate point from whether or not people
>> liked my schema or not -- I mean whether I de-normalize as people are
>> asking or not, there would still be the question of a summary table for MAX
>> and COUNT queries or to not have a summary table for those. I probably made
>> the original question too open ended.
>>
>> Do you know your answer?
> you said : "Occasionally I will want to know things like "
> you answered to frequency on queries as "the users not called today query
> will be done once a day." as was c) [I'm assuming once?]
> and d) appears to be "ad-hoc" and you said your users can deal with
> latency in response for those.
>
> So finding Min/Max/Count quickly really *don't* matter for tuning.
>
> So the only reason I can see to add a summary table is to ... simplify
> maintenance [note I did NOT say "development"] and then only IF it doesn't
> impact the write speeds beyond an acceptable level. Proper internal /
> external documentation can mitigate maintenance nightmares. If your
> developer(s) can't figure out how to get the data they need from the schema
> - then give them the queries to run. [you are likely better at tuning those
> anyway]
>
> Last consideration - business consumption of data does change over time.
> Disk space is cheap [but getting and keeping speed sometimes isn't]. You
> might consider including ongoing partial archival of the operational data
> during slow usage (write) periods.
>
>
> Roxanne
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message John R Pierce 2014-04-16 19:23:11 Re: Heartbleed Impact
Previous Message Bruce Momjian 2014-04-16 17:49:20 Re: [GENERAL] pg_upgrade & tablespaces

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2014-04-16 19:48:39 Re: Workaround for working_mem max value in windows?
Previous Message Martin French 2014-04-16 17:08:12 Re: Workaround for working_mem max value in windows?