Re: [PROPOSAL] Improvements of Hunspell dictionaries support

From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PROPOSAL] Improvements of Hunspell dictionaries support
Date: 2016-01-09 17:31:59
Message-ID: 5691440F.2030502@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for review.

On 09.01.2016 02:04, Alvaro Herrera wrote:
> Artur Zakirov wrote:
>> --- 74,80 ----
>>
>> typedef struct aff_struct
>> {
>> ! uint32 flag:16,
>> type:1,
>> flagflags:7,
>> issimple:1,
> By doing this, you're using 40 bits of a 32-bits-wide field. What does
> this mean? Are the final 8 bits lost? Does the compiler allocate a
> second uint32 member for those additional bits? I don't know, but I
> don't think this is a very clean idea.
No, 8 bits are not lost. This 8 bits are used if a dictionary uses
double extended ASCII character flag type (Conf->flagMode == FM_LONG) or
decimal number flag type (Conf->flagMode == FM_NUM). If a dictionary
uses single extended ASCII character flag type (Conf->flagMode ==
FM_CHAR), then 8 bits lost.

You can see it in decodeFlag function. This function is used in
NIImportOOAffixes function, decode affix flag from string type and store
in flag field (flag:16).
>
>> typedef struct spell_struct
>> {
>> ! struct
>> {
>> ! int affix;
>> ! int len;
>> ! } d;
>> ! /*
>> ! * flag is filled in by NIImportDictionary. After NISortDictionary, d
>> ! * is valid and flag is invalid.
>> ! */
>> ! char *flag;
>> char word[FLEXIBLE_ARRAY_MEMBER];
>> } SPELL;
> Here you removed the union, with no rationale for doing so. Why did you
> do it? Can it be avoided? Because of the comment, I'd imagine that d
> and flag are valid at different times, so at any time we care about only
> one of them; but you haven't updated the comment stating the reason for
> that no longer to be the case. I suspect you need to keep flag valid
> after NISortDictionary has been called, but if so why? If "flag" is
> invalid as the comment says, what's the reason for keeping it?
Union was removed because the flag field need to be dynamically sized.
It had 16 size in the previous version. In this field flag set are
stored. For example, if .dict file has the entry:

mitigate/NDSGny

Then the "NDSGny" is stored in the flag field.

But in some cases a flag set can have size bigger than 16. I added this
changes after this message
http://www.postgresql.org/message-id/CAE2gYzwom3=11U9G8ZxMT5PLkZrwb12BWzxh4dB3HUd89FOSrg@mail.gmail.com
In that Turkish dictionary there are can be large flag set. For example:

özek/2240,852,749,5026,2242,4455,2594,2597,4963,1608,494,2409

This flag set has 56 size.
This "flag" is valid all the time. It is used in NISortDictionary and it
is not used after NISortDictionary function has been called. Maybe you
right and there are no reason for keeping it, and it is necessary to
store all flags in separate variable, that will be deleted after
NISortDictionary has been called.

> The routines decodeFlag and isAffixFlagInUse could do with more
> comments. Your patch adds zero. Actually the whole file has not nearly
> enough comments; adding some more would be very good.
>
> Actually, after some more reading, I think this code is pretty terrible.
> I have a hard time figuring out how the original works, which makes it
> even more difficult to figure out whether your changes make sense. I
> would have to take your patch on faith, which doesn't sound so great an
> idea.
>
> palloc / cpalloc / tmpalloc make the whole mess even more confusing.
> Why does this file have three ways to allocate memory?
>
> Not sure what's a good way to go about this. I am certainly not going
> to commit this as is, because if I do whatever bugs you have are going
> to become my problem; and with the severe lack of documentation and
> given how fiddly this stuff is, I bet there are going to be a bunch of
> bugs. I suspect most committers are going to be in the same position.
> I think you should start by adding a few comments here and there on top
> of the original to explain how it works, then your patch on top. I
> suppose it's going to be a lot of work for you but I don't see any other
> way.
>
> A top-level overview about it would be good, too. The current comment
> at top of file states:
>
> * spell.c
> * Normalizing word with ISpell
>
> which is, err, somewhat laconic.
>
I will provide comments and explain how it works in comments. Maybe I
will add some explanation about dictionaries structure. I will update
the patch soon.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Artur Zakirov 2016-01-09 17:42:05 Re: [PROPOSAL] Improvements of Hunspell dictionaries support
Previous Message Andrew Dunstan 2016-01-09 17:00:58 Re: [COMMITTERS] pgsql: Blind attempt at a Cygwin fix