Re: Approach to Data Summary and Analysis

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Robert DiFalco <robert(dot)difalco(at)gmail(dot)com>, Roxanne Reid-Bennett <rox(at)tara-lu(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Approach to Data Summary and Analysis
Date: 2014-04-16 01:27:25
Message-ID: 534DDC7D.5050702@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-performance

On 16/04/14 13:10, Robert DiFalco wrote:
> 1. >500K rows per day into the calls table.
> 2. Very rarely. The only common query is gathering users that have not
> been called "today" (along with some other qualifying criteria). More
> analytical queries/reports are done for internal use and it is not
> essential that they be lickity-split.
> a. Usually just one connection at a time executes read queries.
> b. the users not called today query will be done once a day.
> c. Daily
> d. All users for the last year (if you are asking about retention). We
> will also rarely have to run for all time.
> e. Not that I'm aware of (or seen) today.
> f. For the simple queries we cannot afford latency between calls and
> querying who was already called.
>
> While I don't seem to be getting much support for it here :D my write
> performance (which is most essential) has been much better since I
> further normalized the tables and made it so that NULL is never used
> and data is never updated (i.e. it is immutable once it is written).
>
> As for wanting to avoid NULLs I don't really know what to say.
> Obviously some times NULL's are required. For this design I don't
> really need them and they make the data harder to reason about
> (because they are kind of open to interpretation). They can also give
> you different results than you sometimes expect (for example when
> looking for a non matching key, you start having to inject some OR IS
> NULLs and such). Also, the absence of null can make a lot of queries
> more optimal). That said, I understand where you all are coming from
> with de-normalization. It's definitely the path of the least
> resistance. Our instinct is to want to see all related data in a
> single table when possible.
>
> The summary table was really a separate point from whether or not
> people liked my schema or not -- I mean whether I de-normalize as
> people are asking or not, there would still be the question of a
> summary table for MAX and COUNT queries or to not have a summary table
> for those. I probably made the original question too open ended.
>
>
> On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <rox(at)tara-lu(dot)com
> <mailto:rox(at)tara-lu(dot)com>> wrote:
>
> On 4/14/2014 12:27 PM, Robert DiFalco wrote:
>
> And so on for calls_connected, calls_completed, call_errors, etc.
>
> Occasionally I will want to know things like "When was the
> last time a user answered a call" or "How many times has a
> user been called".
> ...
>
> Sometimes I might want to get this data for a whole bunch of
> users.
> ...
>
> So the other option is to create a call_summary table that is
> updated with triggers.
> ...
>
> My only issue with a summary table is that I don't want a
> bunch of null fields.
> ...
>
> But over time all fields would eventually become non-NULL.
>
> So that leads me to a summary table for EACH call state. This
> approach has the down side that it creates a lot of tables and
> triggers. It has the upside of being pretty efficient without
> having to deal with NULL values. It's also pretty easy to
> reason about.
> ...
>
> So for my question -- is the choice between these a personal
> preference sort of thing or is there a right or wrong
> approach? Am I missing another approach that would be better?
> I'm okay with SQL but I'm not expert so I'm not sure if there
> is an accepted DESIGN PATTERN for this that I am missing.
>
> There is no right or wrong - there is better, worse, best, and
> worst for any specific scenario. In my experience, most people
> have time/money to get to an 80% "better" design than all the
> other answers during design and then it gets refined over time.
> And yes, personal experience does play a part in how people
> interpret better/worse [aka religion] ;)
>
> I didn't see anybody ask these questions - and to identify
> "better" - they have to be asked.
> 1. How much data are you feeding into your system how fast?
> this directly affects your choices on distribution,
> parallel processing... writes vs updates vs triggers for copying
> vs all reads
> [and if on bare metal - potentially where you place your
> logs, indexes, core lookup tables, etc]
> 2. How much data are you reading out of your system - how fast?
> you have given "simple" use cases (how many calls completed
> within a time frame or to a number)
> you have given very slightly more complex use cases (when
> was the last time John answered a call)
> you have given a slightly more bulky processing question of
> (how many times have these users been called)
> So...
> a) How many users executing read queries do you have?
> b) What is the expected load for simple queries (per
> week/day/hour/minute - depending upon your resolution on speed)
> c) What is the expected load for your mid-line complex queries
> d) What is the "maximum" volume you expect a bulk query to
> go after (like all users in the last 18 years, or this city's
> users in the last day?) and how frequently will that kind of
> query be executed? How much tolerance for delay do your users have?
> e) do you have any known really complex queries that might
> bog the system down?
> f) How much lag time can you afford between capture and
> reporting?
>
> Answers to the above define your performance requirements - which
> defines the style of schema you need. Queries can be written to
> pull data from any schema design - but how fast they can perform
> or how easily they can be created...
>
> Chris and Vincent both targeted a balance between writes and reads
> - which adequately answers 80-85% of the usages out there. But
> you didn't give us any of the above - so their recommendation
> (while very likely valid) may not actually fit your case at all.
>
> As to design patterns -
> "Generally" a database schema is more normalized for an
> operational system because normalization results in fewer
> writes/updates and lowers the risk of corruption if a failure
> takes place. It also isolates updates for any specific value to
> one location minimizing internally caused data corruption.
> Reporting systems are generally less normalized because writes are
> more one-time and reads are where the load occurs.
> Sometimes you have to use data replication to have a system that
> appropriately supports both.
>
> you have shown you are already aware of normalization.
> If you weren't aware of approaches to Data Warehousing... you can
> review information about how it is accomplished
> - see the blogs on kimballgroup DOT com they cover a lot of
> high(er) level concepts with enough specificity to be of some
> direct use.
> [that website also covers some ideas for "Big Data" which aren't
> necessarily limited to RDBMS']
>
> Specify your performance requirements, then figure out your schema
> design.
>
> FWIW I don't understand your (or any other person's) hesitancy for
> "lots of" "NULL" values. They provide meaning in a number of
> different ways... not the least of which is that you don't know
> (yet) - which is knowledge in and of itself.
>
> Roxanne
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org
> <mailto:pgsql-general(at)postgresql(dot)org>)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
Have you considered partial indexes? Using the /WHERE //predicate/
option of /CREATE INDEX/.

This can be useful if you often look for things that are often only a
small subset of keys. For example a partial index on sex would useful
for nurses, only indexing those that are male as they are in a very
small minority.

Cheers,
Gavin

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Fenn Bailey 2014-04-16 04:25:01 Re: Non-deterministic 100% CPU hang on postgres 9.3
Previous Message Robert DiFalco 2014-04-16 01:10:10 Re: Approach to Data Summary and Analysis

Browse pgsql-performance by date

  From Date Subject
Next Message Nick Eubank 2014-04-16 01:36:37 Workaround for working_mem max value in windows?
Previous Message Robert DiFalco 2014-04-16 01:10:10 Re: Approach to Data Summary and Analysis