Re: My honours project - databases using dynamically attached entity-properties

From: David Fetter <david(at)fetter(dot)org>
To: Edward Stanley <Edward(dot)Stanley(at)mcs(dot)vuw(dot)ac(dot)nz>
Cc: Richard Huxton <dev(at)archonet(dot)com>, Sean Utt <sean(at)strateja(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Andrew Hammond <andrew(dot)george(dot)hammond(at)gmail(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: My honours project - databases using dynamically attached entity-properties
Date: 2007-03-14 14:24:43
Message-ID: 20070314142443.GA8454@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 14, 2007 at 12:07:45PM +1300, Edward Stanley wrote:
> On Wed, 14 Mar 2007, David Fetter wrote:
> > On Tue, Mar 13, 2007 at 05:54:34PM +0000, Richard Huxton wrote:
> > > David Fetter wrote:
> > > >On Tue, Mar 13, 2007 at 02:21:37PM +0000, Richard Huxton wrote:
> > > >>David Fetter wrote:
> > > >>>On Tue, Mar 13, 2007 at 09:31:45AM +0000, Richard Huxton wrote:
> > > >>>>* Another good example is the "questionnaire".
> > > >>>
> > > >>>With all due respect, this is a solved problem *without EAV or
> > > >>>run-time DDL*. The URL below has one excellent approach to this.
> > > >>>
> > > >>><http://www.varlena.com/GeneralBits/110.php>
> > > >>
> > > >>Which broadly speaking was the solution I used for my
> > > >>questionnaire, except I had a restricted set of types so basically
> > > >>just coerced them to text and side-stepped the inheritance issue.
> > > >>To the extent that it's dynamic, it's still just EAV though.
> > > >
> > > >That's precisely the difference between the above solution and
> > > >yours, and it's the difference between a good design and one that
> > > >will come up and bit you on the as^Hnkle.
> > >
> > > It's still basically EAV (either approach). The key fault with EAV
> > > is that the tables have no semantic meaning - answer_int contains
> > > number of oranges, days since birth and the price of a tube ticket
> > > in pennies.
> >
> > Stuffing all of those into an answer_int is *precisely* what the end
> > user must not do. That's pilot error.
> >
> > > Now, with a questionnaire that might not matter because everything
> > > is an "answer" and you're not necessarily going to do much more than
> > > count/aggregate it.
> >
> > See above.
> >
> > > >>It doesn't remove the need for run-time DDL if you allow users to
> > > >>add their own questions.
> > > >
> > > >Sure it does. When a user, who should be talking with you, wants
> > > >to ask a new kind of question, that's the start of a discussion
> > > >about what new kind(s) of questions would be generally applicable
> > > >in the questionnaire schema. Then, when you come to an agreement,
> > > >you roll it into the new schema, and the whole system gets an
> > > >improvement.
> > >
> > > Fine, but if you're not letting the user extend the system, then
> > > it's not really addressing Edward's original posting, is it?
> >
> > It's my contention that Edward's original idea is ill-posed. SQL is
> > just fine for doing this kind of thing, and it's *not that hard*.
> >
> > > If the user's talking to me, I might as well just write the DDL
> > > myself - it's the talk that'll take the time, not writing a dozen
> > > lines of SQL.
> >
> > It's the talk that's the important part. Machines are really bad at
> > seeing the broader picture. In the attempt to "save" a few minutes'
> > discussion, he's trying to borrow that time from a system asked to do
> > things that computers are inherently bad at doing, and every end user
> > will pay that time back at a very high rate of interest. This is
> > precisely the kind of false economy that so plagues software
> > development and maintenance these days.
> >
> > > The interesting part of the problem (from a Comp-Sci point of view)
> > > is precisely in automating part of that discussion. It's providing
> > > an abstraction so that you don't end up with a mass of attributes
> > > while still providing freedom to the user.
> >
> > This freedom and efficiency you're talking about is better supplied,
> > IMHO, by putting a standard DDL for questionnaires up on a pgfoundry
> > or an SF.net. That way, improvements to the DDL get spread all over
> > the world, and a very large amount of wheel reinvention gets avoided.
> > Reusable components are a big chunk of both freedom and efficiency. :)
> >
> > Cheers,
> > D
>
> Maybe I should rethink the problem a bit - from the very brief
> initial research I've done, it seems EAV schemas have two common
> uses:

> 1) When new attributes have to be created on-the-fly
> 2) When the number of possible properties for an entity greatly (orders of
> magnitude) exceeds the number of properties any one entity is likely to have.

Um, no. The first use case is bad coding practice, and the second is
a classic case for a join table, which is the standard way to handle
M:N relationships.

> I'm not sure about solving the first problem - there seems to be a lot of
> debate around this. I can see reasons for and against allowing this. However
> I think the second is a very real problem. One such example is a patient
> record system.
>
> For each patient we have a table of common data (dob, sex, height, weight etc)
> but as well as this a patient can present with many symptoms. This might be a
> table of 40,000 possible symptoms.

Here's how I'd do that:

CREATE TABLE patient (
patient_id SERIAL PRIMARY KEY, /* for simplicity. Some
combination of columns in the
table would also have a UNIQUE
NOT NULL constraint on it.
*/
...
);

CREATE TABLE symptom (
symptom_id SERIAL PRIMARY KEY, /* See above. */
...
);

CREATE TABLE patient_presents_with (
patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
UNIQUE(patient_id, symptom_id)
);

> Lets say we want to run a query on these symptoms (using a boolean expression)

I'd use something like the following:

SELECT
p.patient_id,
p.f_name,
p.l_name,
s.symptom_name,
s.symptom_desc
FROM
patient p
JOIN
patient_presents_with ppw
USING (patient_id)
JOIN
symptom s
USING (symptom_id)
WHERE
s.symptom_name = ALL('foo','bar','baz')
AND
s.symptom_name = ANY('quux','fleeg');

> to return the patient records which match the query string on the symptoms.
>
> (This turns out to be a very similar problem to the 'tags' example I first
> presented) - assume a similar schema. With more than a couple of symptoms and
> a complex tree, the resulting SQL can span pages.

Not really. See above :)

> When I first started thinking about this project I believed the two problems
> essentially to be the same class of problem, but this may not be the case.

EAV will bite you. It's not *that* much work to keep its from biting
you. :)

Cheers,
D
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2007-03-14 14:28:03 Re: My honours project - databases using dynamically attached entity-properties
Previous Message Teodor Sigaev 2007-03-14 14:07:34 Re: need help in understanding gist function