Re: full text search and hyphens in uuid

From: Martin Norbäck Olivers <martin(at)norpan(dot)org>
To: Steve Midgley <science(at)misuse(dot)org>
Cc: pgsql-sql(at)lists(dot)postgresql(dot)org
Subject: Re: full text search and hyphens in uuid
Date: 2023-10-29 16:31:07
Message-ID: CALoTC6sk1_8-r6e-9-t2oAe79yX94iG=AOkSH2nHRnxgDrHGFg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hi! Thanks for answering.

My use case for doing this is that I have uuids embedded within the text
data (it's JSON data actually) and I just index to_tsvector('simple',
json_column).

And I want to search for the uuids sometimes, and it's not predetermined
which json keys contain them. But it does seem like it's not possible to
change the to_tsvector lexer, so I guess I will have to extract the uuids
when inserting the data and index them separately.

Regards,
Martin

On Sat, Oct 28, 2023 at 5:48 PM Steve Midgley <science(at)misuse(dot)org> wrote:

> On Fri, Oct 27, 2023 at 4:49 AM Martin Norbäck Olivers <martin(at)norpan(dot)org>
> wrote:
>
>> Hi!
>> I have a problem with full text search and uuids in the text which
>> I index using to_tsvector . I have uuids in my text and most of the time,
>> it works well because they are lexed as words so I can just search for the
>> parts of the uuid.
>>
>> The problem is an uuid like this:
>> select to_tsvector('simple','0232710f-8545-59eb-abcd-47aa57184361')
>>
>> Which gives this result
>> '-59':3 '-8545':2 '0232710f':1 '47aa57184361':7 'abcd':6 'eb':5
>> 'eb-abcd-47aa57184361':4
>>
>> So, I found dict_int and asked it to remove the minus signs
>>
>> create extension dict_int;
>> ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 12, absval = true);
>> alter text search configuration simple alter mapping for int, uint with
>> intdict
>>
>> and now I get this result instead:
>> '0232710f':1 '47aa57184361':7 '59':3 '8545':2 'abcd':6 'eb':5
>> 'eb-abcd-47aa57184361':4
>>
>> which is slightly better, but still not good enough because there is no
>> token 59eb. It's being split into 59 and eb.
>>
>> Is there any way to change this behaviour of the tsvector lexer? Do I
>> have to write my own tsvector or is there a way to "turn off" integer
>> handling in the lexer?
>>
>> Regards,
>> Martin
>>
>> I don't understand your use case for doing this, but it seems like you
> could use something other than ts_vector to break apart your uuids, and
> then index them? It seems like ts_vector is primarily used to find things
> that are near to other things via their vector signatures (at least that's
> my understanding). But doing vector component math on segments of a UUID
> seems meaningless since the UUID is mostly random?
>
> So couldn't you break your UUID into separate fields, or barring that into
> a jsonb or array field that contains the components, and then just index
> that computed field? Maybe that could even be achieved in a view, if you
> don't want to alter your core table?
>
> Obviously all this could be insensible, if I'm not following the purpose
> of your use of ts_vector..
> Best,
> Steve
>

--
Martin Norbäck Olivers
IT-konsult, Masara AB
Telefon: +46 703 22 70 12
E-post: martin(at)norpan(dot)org
Kärrhöksvägen 4
656 72 Skattkärr

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Sanjay Minni 2023-11-08 16:02:16 DEFAULT in update & prepared statements
Previous Message Tom Lane 2023-10-28 02:05:14 Re: full text search and hyphens in uuid