Re: The "many nulls" problem

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kynn Jones <kynnjo(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: The "many nulls" problem
Date: 2008-03-14 18:59:05
Message-ID: Pine.LNX.4.64.0803142158140.27666@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Kynn,

have you seen contrib/hstore ? You can have one table with common attributes
and hide others in hstore

Oleg
On Fri, 14 Mar 2008, Kynn Jones wrote:

> It often happens that a particular pieces of information is non-null for a
> small minority of cases. A superficially different manifestation of this is
> when two pieces of information are identical in all but a small minority of
> cases. This can be easily mapped to the previous description by defining a
> null in one column to mean that its contents should be obtained from those
> of another column. A further variant of this is when one piece of
> information is a simple function of another one in all but a small minority
> of cases.
>
> (BTW, I vaguely recall that RDb theorists have a technical term for this
> particular design issue, but I don't remember it.)
>
> In all these cases, the design choice, at least according to RDb's 101, is
> between including a column in the table that will be NULL most of the time,
> or defining a second auxiliary column that references the first one and
> holds the non-redundant information for the minority of cases for which this
> is necessary (and maybe define a VIEW that includes all the columns).
>
> But for me it is a frequent occurrence that my quaint and simple RDb's 101
> reasoning doesn't really apply for PostgreSQL. Basically, Pg is too smart
> for it! For example, does a large proportion of NULLs really imply a lot of
> wasted space? Maybe this is true for fixed-length data types, but what
> about for type TEXT or VARCHAR?
>
> Just to be concrete, consider the case of a customers database for some home
> shopping website. Suppose that, as it happens, for the majority of this
> site's customers, the shipping and billing addresses are identical. Or
> consider the scenario of a company in which, for most employees, the email
> address can be readily computed from the first and last name using the rule
> First M. Last => first_last(at)acme(dot)com, but the company allows some
> flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's
> known to everyone by his nickname, Yaz, the email is
> yaz_tanaka(at)acme(dot)combecause hardly anyone remembers or even knows his
> full name.)
>
> What's your schema design approach for such situations? How would you go
> about deciding whether the number of exceptional cases is small enough to
> warrant a second table? Of course, one could do a systematic profiling of
> various possible scenarios, but as a first approximation what's your
> rule-of-thumb?
>
> TIA!
>
> Kynn
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Pascal Cohen 2008-03-14 19:24:25 Re: Hardware question for a DB server
Previous Message Kynn Jones 2008-03-14 18:05:50 The "many nulls" problem