Why can I not get lexemes for Hebrew but can get them for Armenian?

From: Sam Saffron <sam(dot)saffron(at)gmail(dot)com>
To: PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Why can I not get lexemes for Hebrew but can get them for Armenian?
Date: 2019-02-27 10:11:37
Message-ID: CAAtdryM4vrD+XEOho7me4pH7qHN=DpjF6QFe1BJXFgAQkHE3nA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

(This is a cross post from Stack Exchange, not getting much traction there)

On my Mac install of PG:

```
=# select to_tsvector('english', 'abcd สวัสดี');
to_tsvector
-------------
'abcd':1
(1 row)

=# select * from ts_debug('hello สวัสดี');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | สวัสดี | {} | |
(2 rows)
```

On my Linux install of PG:

```
=# select to_tsvector('english', 'abcd สวัสดี');
to_tsvector
-------------------
'abcd':1 'สวัสดี':2
(1 row)

=# select * from ts_debug('hello สวัสดี');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-------------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | | {} | |
word | Word, all letters | สวัสดี | {english_stem} |
english_stem | {สวัสดี}
(3 rows)

```

So something is clearly different about the way the tokenisation is
defined in PG. My question is, how do I figure out what is different
and how do I make my mac install of PG work like the Linux one?

On both installs:

```
# SHOW default_text_search_config;
default_text_search_config
----------------------------
pg_catalog.english
(1 row)

# show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
(1 row)
```

So somehow this mac install thinks that thai letters are spaces... how
do I debug this and fix the "Space Symbol" definition here.

Interestingly this install works with Armenian, but falls over when we
reach Hebrew

```
=# select * from ts_debug('ԵԵԵ');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+----------------+--------------+---------
word | Word, all letters | ԵԵԵ | {english_stem} | english_stem | {եեե}
(1 row)

=# select * from ts_debug('אאא');
alias | description | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
blank | Space symbols | אאא | {} | |
(1 row)
```

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Luca Ferrari 2019-02-27 11:21:40 why not using a mountpoint as PGDATA?
Previous Message Achilleas Mantzios 2019-02-27 09:39:40 Re: Barman disaster recovery solution