Re: how to optimize my c-extension functions

From: TJ O'Donnell <tjo(at)acm(dot)org>
To: pgsql-general(at)postgresql(dot)org
Cc: Pierre-Frédéric Caillaud <lists(at)boutiquenumerique(dot)com>
Subject: Re: how to optimize my c-extension functions
Date: 2005-01-10 01:26:16
Message-ID: 41E1D9B8.3030306@acm.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Let me first say that I will routinely be dealing with
one million+ rows, so I want to take care to optimize
my work as much as possible, and to consider carefully
my design decisions.

The only type of search will be of the type:

Select smiles,id from structure where oe_matches(smiles,'c1ccccc1C(=O)N');

or joins with other tables e.g.

Select id,smiles,clogp from structure,properties where
oe_matches(smiles,'c1ccccc1C(=O)N') and id = prop_id;
with id being a sequence (hence unique) and foreign
key prop_id column of properties.

There may be other useful functions of smiles, e.g.
int oe_count_matches(smiles,'CCC'),
and these would also prefer to use the pre-parsed smiles
c++ object.

After I parse the smiles,
the character string smiles is really of no use anymore.
It might be output, for example to an external program such as
smiles_to_jpeg which re-parses the smiles and need not be fast.
So, there is no real use for indexing smiles. So I want to
borrow the internal tables normally used for indexing to store
my parsed smiles and use the parsed smiles in oe_matches and other
functions.
If I do this, maybe I would have to use operators (=,<,>,LIKE?)
to do the matching. A c-function is simply called with data and
would have no access to indexes, correct?

TJ

Pierre-Frédéric Caillaud wrote:
>
> Well, first and easy thing you can do is create a column to store
> the parsed representation and update it via a trigger when the
> original, unparsed column is updated or inserted.
> Is this sufficiently "hidden from the user" for you ? I know it's
> not really hidden, but the fact that updating is automatic could be
> good enough.
> You could also store this column in another table and join with the
> main table.
>
> What are the kinds of searches you do more often ? Can you give a
> few examples ?
>
>
>> Yes, my c function and it's sql counterpart, oe_matches(smiles)
>> uses two steps (1) parse smiles (2) search parsed smiles.
>> Parsing is expensive. The smiles has an external string representation,
>> which is stored in a smiles column, but only the parsed form is
>> actually searchable.
>> The smiles representation is never searched in a traditional string
>> manner, except perhaps for a direct lookup (string equals).
>> LIKE has no meaning for smiles strings, similarly < or > are
>> meaningless.
>>
>> Smiles is parsed into atom and bond representations using
>> 3rd party c++ code/methods. I simply call their methods
>> to parse and search. A binary string can be got from them
>> for persistent storage in a postgres column. It can then be
>> restored into a c++ object for searching, thus avoiding the
>> parsing stage, except for the initial creation of a row with
>> a smiles column.
>>
>> My goal here is to optimize the search by storing the parsed smiles,
>> YET HIDE THIS FROM THE USER. I thought I might be able to store
>> the parsed smiles in an index for me to use while searching, even
>> though it would not be used for indexing in the traditional manner.
>> This would mean creating a new indexing method. I've read up on this
>> and it seems a daunting task. Am I perverting the index method if
>> I try to do this?
>>
>> So, aside from having the user be responsible for maintaining a
>> column of parsed smiles (and possibly other related columns which
>> would speed up the search), is there a way I can create and maintain
>> a table related to the table containing the smiles - and all
>> behind the scenes so the sql user is unaware of this.
>> My thought was that an index is like that and I might borrow some
>> of the internal uses of indexing for my purposes.
>>
>> TJ O'Donnell
>> tjo(at)acm(dot)org
>>
>> Pierre-Frédéric Caillaud wrote:
>>
>>> I gather your program uses two steps, let's call them :
>>> - parse( smiles ) -> data
>>> - search( data ) -> result
>>> You can create a functional index on your smiles column, but I
>>> don't know if this will help you ; you can do things like CREATE
>>> INDEX ... ON mytable( lower( myfield )), then SELECT ... FROM
>>> mytable WHERE lower(myfield) = something, BUT in your case I gather
>>> your search function which processes the parsed data does a lot
>>> more than just dome character match, so creating a functional index
>>> on parse(smile) would be useless for selecting on
>>> search(parse(smile))...
>>> So, in any case, if the parsing phase is slow, you can store
>>> the preparsed data in a text or binary field and search it
>>> directly, but this will not be indexed.
>>> If you can map a subset of your searchable properties to
>>> simple datatypes, you could do a first search for related matches,
>>> as you said.
>>> You say nothing about how your system works internally, whta
>>> kind of representation is it and what kind of searches do you
>>> actually do ?
>>> On Sat, 08 Jan 2005 15:50:06 -0800, TJ O'Donnell <tjo(at)acm(dot)org>
>>> wrote:
>>>
>>>> I've written a c-language extension to postgresql to implement a
>>>> chemical search of a varchar column (named smiles, typically).
>>>> It might be called as:
>>>> oe_matches(smiles,'COCC') where 'COCC' is a typical search string.
>>>> This uses 3rd party functions to parse the varchar inputs into c++
>>>> objects. In order to optimize this search, I'd like to parse the
>>>> whole table containing smiles just once, store it and use the
>>>> parsed data instead of the varchar smiles.
>>>>
>>>> So, I could create another column, say smiles_ob and have the
>>>> oe_matches function take that column. I'd have to be sure the
>>>> smiles_ob column was related (by a trigger?) to the smiles column.
>>>> But I was thinking I might be able to hide these parsed objects
>>>> from the user by somehow incoporating the parsed objects into a
>>>> type of index.
>>>> I'd like also to use additional columns (like molecular formula) in
>>>> the match function to "triage" the table to quickly rule out impossible
>>>> matches, before doing a full search/match.
>>>>
>>>> Am I way off the track here? Is it a bad idea to clutter the index
>>>> with things like this? Is it possible? Is there another
>>>> approach that could hide some of these details from the user - meaning
>>>> they would not have to create/update these additional columns?
>>>>
>>>> Thanks,
>>>> TJ O'Donnell
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 9: the planner will ignore your desire to choose an index scan
>>>> if your
>>>> joining column's datatypes do not match
>>>>
>>>
>>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Keith C. Perry 2005-01-10 02:05:51 Re: PostgreSQL users on webhosting
Previous Message Pierre-Frédéric Caillaud 2005-01-09 22:59:35 Re: how to optimize my c-extension functions