Re: How best to represent relationships in a database generically?

From: Ted Byers <r(dot)ted(dot)byers(at)rogers(dot)com>
To: Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>, pgsql-general(at)postgresql(dot)org
Subject: Re: How best to represent relationships in a database generically?
Date: 2007-07-27 19:46:48
Message-ID: 225786.77509.qm@web88304.mail.re4.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Here are just a few observations.

What you have described is a classic taxonomy problem. The simplest approach is to create a table to contain the basic attributes of the obtjects in the database. Using your example, such attributes may be anatomical (two eyes, two ears, four limbs, one backbone, &c.), ecological (categories of food items, such as plant, animal, fungal, &c., temperature tolerances, &c.), conventional taxonomic data (such as genus, family, class, phylum, &c.). You might even get creative and choose to use additional novel or unconventional kinds of attributes. And you could go so far as to add a dynamic classification scheme based on conventional or novel numeric taxonomic algorithms, so the data itself provides a "natural" classification hierarchy. Additional tables may provide more useful predictive capability, such as a table representing empiric observations (based either on field observation or analysis of gut contents) of what eats what (e.g. on date x animal y was observed
eating z by Fred at location Q). Analysis of such a table could provide bases for inferences such as "animal x eats mammals of the family" or "animal y eats only insects, and all insects that will fit in its mouth". This avoids a need for predefined relationships, and a suite of probable relationships can be constructed directly from the basic data. It also provides support for a measure of predictive capability. For example, if animal x has been observed eating mule deer, you might predict that it may also eat white tail deer where they coexist even if the animal has not been observed eating white tail deer.

I'd suggest avoiding the route involving defining very high dimensional spaces. That gets unweldy very quickly, and you'll find that most of such high dimensional spaces is empty.

The probability of observing grass eating a cow is identically 0. But the probability of observing pitcher plants, sun dews, or venus fly traps eating insects is identically 1, on the condition you make an effort to study carnivorous plants. :-) The number of possible relationships between objects increases as n! where n is the number of objects, but the number of kinds of relationships will be small.

No one can tell you what the right approach is or what a better way to do what you want to do might be, at least from what you have said so far. You haven't said precisely what you want to accomplish. Is it simple information management, to tell your users what they already know? Or perhaps it is a research tool. Or maybe it is simple record keeping, and may rarely if ever be used except for audit purposes.

The only way to provide the kind of answers you want, regarding a comparison of options for managing and using your data, is to do a formal analysis of your problem and how your users may want to use your product. You might want to begin with a problem definition, from the perspective of a client or end user, and then work up a few use case scenarios. And then, from that, you can work up a prototype data model with an entity-relationship diagram (NB: these relationships may or may not be related to the relationships you want to capture from your data - there are different kinds of relationships just as there are different kinds of models - don't you just love the fact natural language is multivocal?). Until this homework is done, the best you'll likely get are a few general observations of the sort I have offered (and I do so from the perspective of an ecologist who happens to develop software). You will want to ensure that any example you offer to illustrate what you
want to do is directly related to the actual problem you are trying to solve. Otherwise the advice you get may be inappropriate. The data model I'd use for a decision support tool for government officials responsible for environmental protection (such as at the EPA in the US) will be quite different from what I'd use for a university researcher, and both will be different from what I'd use for a financial consultant or a human resources office. What you need to do will be as dependant on what real world problem you're trying to solve as it is on what data is available. A tool to support research in the life sciences will probably not want to store relationships among living things explicitly, preferring to infer them from data analysis, but a tool for a government environmental regulator will instead likely want to store them explicitly (e.g. as input to simulation models used to assess dose and risk issues).

Why wait for quantum computers? Don't sell the computers available today short. You can run mesoscale weather models on a modern desktop (the ones that used to run on mainframes only a few years ago, not the highly detailed ones used on mainframes now). I have done ecosystem scale modeling using desktop machines, even those available ten years ago. Yes, I'd love to have more computing power at my disposal, but that doesn't stop me from doing useful stuff now! Modern ecological models are, at present, much more limited by the availability of good data, both for parameterizing models and for validating models, than they are by available computing resources.

HTH

Ted

Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my> wrote:
Hi,

Sorry, this really isn't postgresql specific, but I figure there are
lots of smarter people around here.

Say I have lots of different objects (thousands or even millions?).
Example: cow, grass, tiger, goat, fish, penguin.

BUT I'm not so interested in defining things by linking them to
categories or giving them names, I'm trying to figure out a way to
define things by their relationships with other things, and more
importantly do searches and other processing by those relationships.

So, what would be the best way to store them so that a search for the
relationship like "grass is to cow", will also turn up cow is to
tiger, and goat is to tiger, and fish is to penguin (and penguin is
to bigger fish ;) ), and electricity is to computer. And a search for
cow is to goat, could turn up tiger is to lion, and goat is to cow.

Is the only way to store all the links explicitly? e.g. have a huge
link table storing stuff like obj => cow, subj => grass, type =>
consumes, probability=90% ( => means points/links to). Or even just
have one table (links are objects too).

Or is it possible to somehow put the objects in a multidimensional
space (1000 dimensions?) and keep trying to arrange the objects so
that their relationships/vectors with/from each other are fairly
consistent/reasonable based on "current knowledge"? Trouble is in
some cases the grass eventually eats the cow, so maybe that doesn't
work at all ;).

Or even do both? Maybe use the first as a cache, and the second for
"deeper" stuff ("flash of insight" or "got the punchline" = figure
out better arrangement/ joining of disparate maps).

My worry about the first approach is that the number of links might
go up very much faster as you add more objects. But perhaps this
won't be true in practice. The worry about the second approach is
that it might get "stuck", or run out of dimensions.

Is there a better way to do this? There must be right?

Wait for huge quantum computers and use qubits for each
multidimensional coordinate? ;).

Regards,
Link.

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Richard Broersma Jr 2007-07-27 20:59:27 Re: [GENERAL] PostgreSQL, PGDAY, PGParty and OSCON 2007 Rocked!
Previous Message NetComrade 2007-07-27 18:51:23 Re: Question about Postgres