You are correct. "similar" means within a small range.
Below is a sample query:
select count(*) from speedtest where (p1 between 209 and 309) and (p2
between 241 and 341) and (p3 between 172 and 272) and (p4 between 150
and 250) and (p5 between 242 and 342) and (p6 between 222 and 322) and
(p7 between 158 and 258) and (p8 between 249 and 349) and (p9 between
162 and 262) and (p10 between 189 and 289) and (p11 between 201 and
301) and (p12 between 167 and 267) and (p13 between 167 and 267) and
(p14 between 229 and 329) and (p15 between 235 and 335) and (p16
between 190 and 290) and (p17 between 240 and 340) and (p18 between 156
and 256) and (p19 between 150 and 250) and (p20 between 171 and 271)
and (p21 between 241 and 341) and (p22 between 244 and 344) and (p23
between 219 and 319) and (p24 between 198 and 298) and (p25 between 196
and 296) and (p26 between 243 and 343) and (p27 between 160 and 260)
and (p28 betw een 151 and 251) and (p29 between 226 and 326) and (p30
between 168 and 268) and (p31 between 153 and 253) and (p32 between
218 and 318)
Currently, on an un-tuned installation, this query takes about 1
second. Much too slow for our needs. We need to be able to execute
about 30-50 per second.
I'm not a database expert. There is probably a better way to do this,
but I have no idea how.
The general use of this table is as an index for document storage.
When we come across a new document, we have to know if we already have
something close to it. Exact checksums don't work because two
documents with only a few different words are still "the same" for our
intended use. We calculate 32 separate checksums on parts of each
document. By storing all 32, we have a good representation of each
document. A new document can then very quickly be checked against the
table to see if we already have something close to it.
If anybody has any better ideas, I would love to hear it...
On Saturday, December 21, 2002, at 03:02 PM, Manfred Koizar wrote:
> On Sat, 21 Dec 2002 13:46:05 -0500, Noah Silverman
> <noah(at)allresearch(dot)com> wrote:
>> Without divulging too many company
>> secrets, we create a 32 key profile of an object. We then have to be
>> able to search the database to find "similar" objects.
> ... where "similar" means that the value of each attribute lies within
> a small range around the value of the corresponding attribute of the
> reference object?
> I fear a multicolumn b-tree index is not the optimal solution to this
> problem, unless you have some extremely selective attributes you can
> put at the start of the index. But then again I doubt that it makes
> sense to include even the last attribute (or the last few attributes)
> into the index.
>> In reality, we
>> will probably have 20MM to 30MM rows in our table. I need to very
>> quickly find the matching records on a "test" object.
> This seems to be a nice case for utilizing bitmaps for index scans.
> Thus you would scan several single column indices and combine the
> bitmaps before accessing the heap tuples. This has been discussed on
> -hackers and I believe it is a todo item.
> I don't know, whether GiST or R-Tree could help. Is anybody listening
> who knows?
>> If you're really curious as to more details, let me know (I don't want
>> to bore the group with our specifics)
> The group is patient :-)
In response to
pgsql-performance by date
|Next:||From: Tom Lane||Date: 2002-12-21 20:28:22|
|Subject: Re: Speed Question |
|Previous:||From: Manfred Koizar||Date: 2002-12-21 20:02:39|
|Subject: Re: Speed Question|