Re: Indexing queries with bit masks

From: Mike Christensen <mike(at)kitchenpc(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Hunsberger <peter(dot)hunsberger(at)gmail(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Indexing queries with bit masks
Date: 2010-04-30 21:27:25
Message-ID: q2z7aa638e01004301427u99030385t866aefe4a1efeee3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ok I've been blatantly lying, err, purposely simplifying the problem for the
sake of the original email :)

I've read over the responses, and am actually now considering just not using
any index at all. Here's why:

First, this actually isn't the only thing on the WHERE clause. It will only
query for users who are "friends" with you so it can notify them of your
activities. That's done via a weird JOIN on a table that holds all the
friend relationships. So in reality, it will only load maybe a hundred
rows, or maybe a thousand every once in a while if you're way popular. If
I'm not mistaken, it should use the index to narrow it down to the list of
friends, and then use a sequential scan to weed out the ones who subscribe
to that type of notification.

Second, the only thing /ever/ that will do this query is the queue service
whose job it is to process notifications (which are files dropped on the
file system) and email people all day long. This service handles one job at
a time, and could potentially run on its own machine with its own read-only
copy of the database. Thus, even if it was a fairly slow query, it's not
gonna bring down the rest of the site.

Regarding the idea of putting an index on each bit, I thought about this
earlier as well as just kinda cringed. The users table gets updated quite a
bit (last logon, session id, any time they change their profile info,
etc).. Too many indexes is bad. I could just put the data in another table
of course, which lead me to another idea. Have a table called Subscriptions
and have each row hold a user id and a notification type. I could index
both, and join on (Subscriptions.UserId = Users.UserId AND
Subscriptions.Type = 8). This would be pretty dang fast, however updates
are kinda a royal pain. When the user changes which types of subscriptions
they want (via a list of checkboxes), I'd have to figure out which rows to
delete and which new ones to insert. However, I think I have an idea in
mind for a PgSQL function you pass in the bitmask to and then it
"translates" it to conditional deletes and inserts.

A third idea I'm tossing around is just not worry about it. Put the bitmask
in the DB, but not filter on it. Every "friend" would be loaded into the
dataset, but the queue processor would just "skip" rows if they didn't
subscribe to that event. In other words, move the problem down to the
business layer. The drawback is potentially large number of rows are
loaded, serialized, etc into memory that will just be ignored. But of
course the DB is probably a read-only copy and it's not even close to the
bottle neck of the email queue under heavy load, so it's probably a
non-issue. If mailing is slow, I just add more queue services..

I'm exploring all these ideas. I predict using the bitwise AND on the where
clause isn't gonna be the worst design ever, and it's sure easier to
implement than a table of subscriptions. What do you guys think?

Mike

On Fri, Apr 30, 2010 at 9:29 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Peter Hunsberger <peter(dot)hunsberger(at)gmail(dot)com> writes:
> > If all subscriptions are roughly equal in popularity then any single
> > select should give ~ 10% of the data. That would seem to be selective
> > enough that you'd really want an index?
>
> My personal rule of thumb is that 10% is around the threshold where
> indexes stop being very helpful. At that selectivity, you're going
> to be having to read every page of the table anyway, and it's not
> clear that the extra I/O to read the index is going to get repaid in
> CPU savings. (Now if the table+index are fully cached in RAM, the
> threshold's probably a bit higher, but there still is not reason to
> think that an index is going to make for a huge improvement.)
>
> > If so, any answers to the OP's main question; what would be the most
> > efficient way to handle this type of thing?
>
> Well, btree's right out for indexing bit selections. In principle you
> could maybe do something with a GIN index, but I don't think we ship
> any prefab GIN opclasses for this.
>
> [ thinks for a bit ]
>
> The best idea that comes to mind offhand is to not use an integer, but a
> boolean array, such that the queries look like
>
> select ... where subscriptions[4];
>
> This already gives you one big advantage, which is that you're not
> hard-wiring an assumption about how many notification types there can
> ever be. What I would then do is build a separate partial index for
> each subscription column, ie,
>
> create index ... where subscriptions[1];
> create index ... where subscriptions[2];
> .. etc ..
>
> Now this only works as long as the queries are referencing explicit
> constant subscription numbers, else the planner won't be able to
> match the WHERE clause to any of the partial indexes. But if that
> is a reasonable restriction for your app then it seems like it
> should work.
>
> The main disadvantage of this is that you need N indexes, which could
> get a bit expensive if the table is updated heavily. But you don't need
> to bother maintaining indexes corresponding to subscriptions that are
> too popular to be worth indexing, so some of that could be bought back
> by careful index selection.
>
> Another point is that the partial indexes could be created on some other
> column(s) and thereby serve double duty. This depends on the details of
> your typical queries though. Is the subscriptions[] clause usually used
> by itself, or together with additional WHERE conditions?
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alex Hunsaker 2010-05-01 02:02:52 Re: Inheritance efficiency
Previous Message Scott Marlowe 2010-04-30 20:52:39 Re: Native DB replication for PG