Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function)

From: Robert Haas Alexander Korotkov Itagaki Takahiro , PostgreSQL Hackers Re: levenshtein_less_equal (was: multibyte charater set in levenshtein function) 2010-09-28 02:52:40 AANLkTinin8HjBeJyMdZCaGRTr8C1O4OawLJnQcrg9-W4@mail.gmail.com (view raw or whole thread) 2010-09-24 20:01:24 from Robert Haas  2010-09-27 12:42:52 from Alexander Korotkov   2010-09-28 02:52:40 from Robert Haas    2010-10-04 20:19:18 from Alexander Korotkov     2010-10-08 01:52:26 from Robert Haas      2010-10-08 16:51:40 from Alexander Korotkov       2010-10-13 05:49:27 from Robert Haas        2010-10-13 13:32:36 from Tom Lane         2010-10-13 14:18:01 from Alvaro Herrera          2010-10-13 14:51:59 from Tom Lane           2010-10-13 15:22:17 from Robert Haas            2010-10-13 15:42:28 from Tom Lane             2010-10-13 15:53:35 from Alexander Korotkov              2010-10-13 15:58:32 from Alexander Korotkov             2010-10-13 16:30:25 from Robert Haas          2010-10-13 16:23:22 from Andres Freund pgsql-hackers
```On Mon, Sep 27, 2010 at 8:42 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> The second idea is to make values in matrix possible greater. I analyze what
> exactly is matrix in this case. It is sum of original matrix, which
> represent distances between prefixes of strings, and matrix, which represent
> cost of insertions or deletions for moving to diagonal, which containing
> bottom right cell. There is an example of second matrix summand:
>
>       k  i  t  t  e  n
>    1  2  3  4  5  6  7
> s  0  1  2  3  4  5  6
> i  1  0  1  2  3  4  5
> t  2  1  0  1  2  3  4
> t  3  2  1  0  1  2  3
> i  4  3  2  1  0  1  2
> n  5  4  3  2  1  0  1
> g  6  5  4  3  2  1  0
>
> And an example of resulting matrix:
>
>       k  i  t  t  e  n
>    1  3  5  7  9  11 13
> s  1  2  4  6  8  10 12
> i  3  2  2  4  6  8  10
> t  5  4  2  2  4  6  8
> t  7  6  4  2  2  4  6
> i  9  8  6  4  2  3  5
> n  11 10 8  6  4  3  3
> g  13 12 10 8  6  5  3
>
> The resulting matrix saves important property of original matrix, that cell
> value always greater or equal than values, which are used for it's
> calculation.

Hmm.  So if I understand correctly (and it's possible that I don't),
the idea here is that for each cell in the matrix, we store the sum of
the following two quantities:

1. The cost of transforming an initial substring of s into an initial
substring of t (as before), and
2. The smallest imaginable cost of transforming the remaining portion
of s into the remaining portion of t, based on the difference in the
string lengths.

Thus, any cell whose value is already > max_d can't be relevant to the
final result.

The code seems a bit complex, though.  In particular, I'm wondering if
we couldn't simplify things by leaving the meaning of the cells in the
matrix unchanged and just using this calculation to adjust the lower
and upper bounds for each scan.  Perhaps instead of
curr/prev_left/right we could call them min_i and max_i, and after
each inner loop from min_i to max_i we could try to increment min_i
and decrement max_i based on whether cur[min/max_i] + the smallest
possible residual cost > max_d.  Then you've got to also increment
max_i once for each value of j.  Does that make any sense or am I all
wet?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

```

pgsql-hackers by date

 Next: From: Robert Haas Date: 2010-09-28 03:06:29 Subject: Re: Hot Standby tuning for btree_xlog_vacuum() Previous: From: Itagaki Takahiro Date: 2010-09-28 02:05:07 Subject: Re: I: About "Our CLUSTER implementation is pessimal" patch

Privacy Policy | About PostgreSQL
Copyright © 1996-2016 The PostgreSQL Global Development Group