From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Kynn Jones <kynnjo(at)gmail(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: The "many nulls" problem |
Date: | 2008-03-14 18:59:05 |
Message-ID: | Pine.LNX.4.64.0803142158140.27666@sn.sai.msu.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Kynn,
have you seen contrib/hstore ? You can have one table with common attributes
and hide others in hstore
Oleg
On Fri, 14 Mar 2008, Kynn Jones wrote:
> It often happens that a particular pieces of information is non-null for a
> small minority of cases. A superficially different manifestation of this is
> when two pieces of information are identical in all but a small minority of
> cases. This can be easily mapped to the previous description by defining a
> null in one column to mean that its contents should be obtained from those
> of another column. A further variant of this is when one piece of
> information is a simple function of another one in all but a small minority
> of cases.
>
> (BTW, I vaguely recall that RDb theorists have a technical term for this
> particular design issue, but I don't remember it.)
>
> In all these cases, the design choice, at least according to RDb's 101, is
> between including a column in the table that will be NULL most of the time,
> or defining a second auxiliary column that references the first one and
> holds the non-redundant information for the minority of cases for which this
> is necessary (and maybe define a VIEW that includes all the columns).
>
> But for me it is a frequent occurrence that my quaint and simple RDb's 101
> reasoning doesn't really apply for PostgreSQL. Basically, Pg is too smart
> for it! For example, does a large proportion of NULLs really imply a lot of
> wasted space? Maybe this is true for fixed-length data types, but what
> about for type TEXT or VARCHAR?
>
> Just to be concrete, consider the case of a customers database for some home
> shopping website. Suppose that, as it happens, for the majority of this
> site's customers, the shipping and billing addresses are identical. Or
> consider the scenario of a company in which, for most employees, the email
> address can be readily computed from the first and last name using the rule
> First M. Last => first_last(at)acme(dot)com, but the company allows some
> flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's
> known to everyone by his nickname, Yaz, the email is
> yaz_tanaka(at)acme(dot)combecause hardly anyone remembers or even knows his
> full name.)
>
> What's your schema design approach for such situations? How would you go
> about deciding whether the number of exceptional cases is small enough to
> warrant a second table? Of course, one could do a systematic profiling of
> various possible scenarios, but as a first approximation what's your
> rule-of-thumb?
>
> TIA!
>
> Kynn
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
From | Date | Subject | |
---|---|---|---|
Next Message | Pascal Cohen | 2008-03-14 19:24:25 | Re: Hardware question for a DB server |
Previous Message | Kynn Jones | 2008-03-14 18:05:50 | The "many nulls" problem |