query structure for selecting row by tags

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: query structure for selecting row by tags
Date: 2012-08-01 19:40:09
Message-ID: CAEV0TzDNMbukP+QikJ8PH8tmti3OJCG2Crx5wF+W1prQbD65-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I need to tag entities in my db with arbitrary strings. That part is
simple enough. The difficulty is that I need to select entities based on
whether they match an expression built via boolean combinations of tags:

(tag1 || tag2 || tag3) && tag4

I can roll my own tagging mechanism, in which case it'd probably look like
this:

create table entity (
id bigserial not null primary key,
...
)

create table entity_tags (
entity_fk bigint not null,
tag varchar
)

Or I can use a generic tagging plugin in my webapp framework (grails,
taggable plugin, hibernate under the hood), in which case the table
structure looks like this:

create table entity (
id bigserial not null primary key,
....
)

create table tag (
id bigserial,
tag varchar
)

create table taglink (
entity_ref bigint,
entity_type varchar, -- stores the type of entity since this table is
used for all entities in the system
tag_ref bigint
)

In the grails-native version, there is no explicit foreign key to a single
entity table, since entity_ref is used as a foreign key into as many entity
tables as there are taggable entities, but within any given query, its role
is to act as a foreign key to the table specified in entity_type. tag_ref
is always a foreign key to the tags table, which does eliminate the
replication of the tag string in every row, but given that most tag names
are likely to be pretty short, it probably doesn't save much.

I don't see a very strong reason to support one version over the other, as
the problem of constructing a query that can resolve entity_ids based on
the expression given at the top of this message seems to have similar
difficulty in either context. My question is, is there a sneaky way to
turn my expression into a query other than the following:

For every tag element in the expression, join to the tag table one time
(via taglink, if necessary), and then express a where condition that looks
similar to the expression. If each tag table join is aliased as t1, t2,
t3, ..., the where clause would look like this:

select distinct e.id
from entity e
join entity_tags t1 on t1.entity_fk = e.id
join entity_tags t2 on t2.entity_fk = e.id
join entity_tags t3 on t3.entity_fk = e.id
join entity_tags t4 on t4.entity_fk = e.id
where ((t1.tag = 'tag1' or t2.tag = 'tag2' or t3.tag = 'tag3') and t4.tag =
'tag4') and [rest of where clause goes here]

This has the downside of creating a huge cross multiple of the entity_tags
table with itself, though I could keep it from getting out of hand by
limiting the join to only those tags that are used in the expression -

join entity_tags t1 on t1.entity_fk = e.id and t1.tag in ('tag1', 'tag2',
'tag3', tag4')
join entity_tags t2 on t2.entity_fk = e.id and t2.tag in ('tag1', 'tag2',
'tag3', tag4')
join entity_tags t3 on t3.entity_fk = e.id and t3.tag in ('tag1', 'tag2',
'tag3', tag4')
join entity_tags t4 on t4.entity_fk = e.id and t4.tag in ('tag1', 'tag2',
'tag3', tag4')

There is no reason for a tag expression to ever include more than 10 tag
elements, so I'd have no trouble limiting the number of tags allowed in an
expression to a low number like that in my expression parser.

But maybe there is a better mechanism, one which aggregates all tags into
an array and then builds a where clause that looks for values in a single
array column? In my app, there will be a far higher number of entities
(milions) than tags (hundreds), with many tags shared amongst many
entities, but no entity having more than maybe 5-10 tags (and no entity
having less than 4 tags). An array mechanism would look like this:

select e.id
from entity e
join entity_tags t on t.entity_fk = e.id and t1.tag in ('tag1', 'tag2',
'tag3', 'tag4')
group by e.id
having ((array_agg(t.tag) && ARRAY['tag1','tag2','tag3']::varchar[]) and
array_agg(t.tag) @> ARRAY['tag4'])

though simplifying the having clause that way would be difficult, so it
would probably be more like this:
having ((array_agg(t.tag) @> ARRAY['tag1']::varchar[] or
array_agg(t.tag) @> ARRAY['tag2']::varchar[] or
array_agg(t.tag) @> ARRAY['tag3']::varchar[]) and
array_agg(t.tag) @> ARRAY['tag4']::varchar[])

Hopefully, the query planer would simplify that?

Additionally, there are actually 4 columns of the entity that I want to
treat as tags in the context of a query that selects based on tag
expression. My intent was to just create a tag that contains the same
value as each of the 4 columns and deal with the maintenance of ensuring
that whenever the columns are updated, the tags assigned to the entity
reflect the change. This spares me from having to write a where condition
that checks tag column and 4 separate entity columns for each expression
element. However, maybe it is possible to roll the 4 columns up into the
array of tag values somehow, eliminating the requirement of storing the
string values both in the tags table and in the columns of the entity
table. That would be a big normalization/maintenance win.

I'm open to using hstore or array as a column type in the entity table for
tags, but not for the 4 properties of the entity object that are separate
from tags, because doing so would be really difficult to implement in
GORM/Hibernate. I don't need hibernate for dealing with tag expression
queries (though it'd be nice if I could). I suspect I could just build my
array as array_agg(t.tags) || e.col1 || e.col2 || e.col3

Basically, I'm looking for feedback as to what would be the best strategy
for solving this particular problem. I'm about 99% certain that both
strategies outlined above would function correctly, but I don't have a
sense of which is best (probably the array solution) or if there is an even
better solution that I haven't considered.

Thanks

Browse pgsql-sql by date

  From Date Subject
Next Message Wayne Cuddy 2012-08-02 23:10:43 can this be done with a check expression?
Previous Message David Johnston 2012-07-28 02:31:46 Re: join against a function-result fails