From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Cc: | Егор Чиндяскин <kyzevan23(at)mail(dot)ru>, Richard Guo <guofenglinux(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, mahendrakar s <mahendrakarforpg(at)gmail(dot)com> |
Subject: | Re: Stack overflow issue |
Date: | 2022-08-30 22:57:06 |
Message-ID: | 3802215.1661900226@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I wrote:
> The upstream recommendation, which seems pretty sane to me, is to
> simply reject any string exceeding some threshold length as not
> possibly being a word. Apparently it's common to use thresholds
> as small as 64 bytes, but in the attached I used 1000 bytes.
On further thought: that coding treats anything longer than 1000
bytes as a stopword, but maybe we should just accept it unmodified.
The manual says "A Snowball dictionary recognizes everything, whether
or not it is able to simplify the word". While "recognizes" formally
includes the case of "recognizes as a stopword", people might find
this behavior surprising. We could alternatively do it as attached,
which accepts overlength words but does nothing to them except
case-fold. This is closer to the pre-patch behavior, but gives up
the opportunity to avoid useless downstream processing of long words.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
limit-length-of-strings-passed-to-snowball-2.patch | text/x-diff | 1.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2022-08-30 23:35:54 | Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher |
Previous Message | David Rowley | 2022-08-30 22:40:43 | Re: Reducing the chunk header sizes on all memory context types |