Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Date: 2010-09-27 12:42:52
Message-ID: AANLkTi=bzNBFhCVc+D=L5BP+BjHKYqd9PATBA5QMQQHF@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'll try to illustrate different approaches in examples.

k i t t e n
0 1 2 3 4 5 6
s 1 1 2 3 4 5 6
i 2 2 1 2 3 4 5
t 3 3 2 1 2 3 4
t 4 4 3 2 1 2 3
i 5 5 4 3 2 2 3
n 6 6 5 4 3 3 2
g 7 7 6 5 4 4 3

The approach mentioned in Wikipedia limits filling of the matrix by
diagonals, which are in k-distance from main diagonal (diagonal which
contain left top cell). All other cell is guaranteed to have value greater
than k. This approach can be extended to the case of costs different from 1,
in this case limit diagonals will be closer to main diagonal proportional to
the costs. Here is example of such limited matrix with k = 3.

k i t t e n
0 1 2 3
s 1 1 2 3 4
i 2 2 1 2 3 4
t 3 3 2 1 2 3 4
t 4 3 2 1 2 3
i 4 3 2 2 3
n 4 3 3 2
g 4 4 3

The first idea of my approach is to use actual cell values to limit matrix
filling. For each row the range of columns where cell values are not greater
than k is stored. And in the next row only cells are caclucated, which use
values of cells from previous row in stored range. Here is example of this
approach with k = 3. There are slightly less filled cells but calculation
are more complicated than in previoud approach.

k i t t e n
0 1 2 3
s 1 1 2 3
i 2 2 1 2 3
t 3 3 2 1 2 3
t 3 2 1 2 3
i 3 2 2 3
n 3 3 2
g 3

The second idea is to make values in matrix possible greater. I analyze what
exactly is matrix in this case. It is sum of original matrix, which
represent distances between prefixes of strings, and matrix, which represent
cost of insertions or deletions for moving to diagonal, which containing
bottom right cell. There is an example of second matrix summand:

k i t t e n
1 2 3 4 5 6 7
s 0 1 2 3 4 5 6
i 1 0 1 2 3 4 5
t 2 1 0 1 2 3 4
t 3 2 1 0 1 2 3
i 4 3 2 1 0 1 2
n 5 4 3 2 1 0 1
g 6 5 4 3 2 1 0

And an example of resulting matrix:

k i t t e n
1 3 5 7 9 11 13
s 1 2 4 6 8 10 12
i 3 2 2 4 6 8 10
t 5 4 2 2 4 6 8
t 7 6 4 2 2 4 6
i 9 8 6 4 2 3 5
n 11 10 8 6 4 3 3
g 13 12 10 8 6 5 3

The resulting matrix saves important property of original matrix, that cell
value always greater or equal than values, which are used for it's
calculation. That's why we can use idea about matrix filling limiting for
this matrix, but values in this matrix are greater and it allow to avoid
filling bigger part of matrix. There is an example of matrix filling
limiting for this matrix:

k i t t e n
1 3
s 1 2
i 2 2
t 2 2
t 2 2
i 2 3
n 3 3
g 3

----
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Devrim GÜNDÜZ 2010-09-27 12:51:40 Re: pg_filedump binary for CentOS
Previous Message Dave Page 2010-09-27 12:34:40 Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session