From: | Sam Saffron <sam(dot)saffron(at)gmail(dot)com> |
---|---|
To: | PGSQL Mailing List <pgsql-general(at)postgresql(dot)org> |
Subject: | Why can I not get lexemes for Hebrew but can get them for Armenian? |
Date: | 2019-02-27 10:11:37 |
Message-ID: | CAAtdryM4vrD+XEOho7me4pH7qHN=DpjF6QFe1BJXFgAQkHE3nA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
(This is a cross post from Stack Exchange, not getting much traction there)
On my Mac install of PG:
```
=# select to_tsvector('english', 'abcd สวัสดี');
to_tsvector
-------------
'abcd':1
(1 row)
=# select * from ts_debug('hello สวัสดี');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | สวัสดี | {} | |
(2 rows)
```
On my Linux install of PG:
```
=# select to_tsvector('english', 'abcd สวัสดี');
to_tsvector
-------------------
'abcd':1 'สวัสดี':2
(1 row)
=# select * from ts_debug('hello สวัสดี');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-------------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | | {} | |
word | Word, all letters | สวัสดี | {english_stem} |
english_stem | {สวัสดี}
(3 rows)
```
So something is clearly different about the way the tokenisation is
defined in PG. My question is, how do I figure out what is different
and how do I make my mac install of PG work like the Linux one?
On both installs:
```
# SHOW default_text_search_config;
default_text_search_config
----------------------------
pg_catalog.english
(1 row)
# show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
(1 row)
```
So somehow this mac install thinks that thai letters are spaces... how
do I debug this and fix the "Space Symbol" definition here.
Interestingly this install works with Armenian, but falls over when we
reach Hebrew
```
=# select * from ts_debug('ԵԵԵ');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+----------------+--------------+---------
word | Word, all letters | ԵԵԵ | {english_stem} | english_stem | {եեե}
(1 row)
=# select * from ts_debug('אאא');
alias | description | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
blank | Space symbols | אאא | {} | |
(1 row)
```
From | Date | Subject | |
---|---|---|---|
Next Message | Luca Ferrari | 2019-02-27 11:21:40 | why not using a mountpoint as PGDATA? |
Previous Message | Achilleas Mantzios | 2019-02-27 09:39:40 | Re: Barman disaster recovery solution |