| From: | Blessy Thomas <blessy456bthomas(at)gmail(dot)com> |
|---|---|
| To: | pgsql-general(at)postgresql(dot)org |
| Subject: | Fwd: Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL |
| Date: | 2026-03-23 05:52:06 |
| Message-ID: | CAJyyjtA2=kVgcnZtYYinLqAyOQEmjX=rALM0LNPrh633M-FsvQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general pgsql-hackers |
---------- Forwarded message ---------
From: Blessy Thomas <blessy456bthomas(at)gmail(dot)com>
Date: Mon, 2 Mar 2026 at 12:55
Subject: Extension - multilingual_fuzzy_match : Multilingual phonetic
matching extension for PostgreSQL
Hello PostgreSQL Community,
I would like to introduce a PostgreSQL extension called
multilingual_fuzzy_match. This extension enables multilingual name
normalization, transliteration, and fuzzy phonetic matching directly inside
PostgreSQL at query time.
1. What Problem It Solves:
In multilingual datasets (especially Indian language datasets), the same
name may appear in:
- Different scripts
- Different transliterations
- Slight spelling variations
- Multiple languages
For example:
राम ≈ Raam ≈ رَام ≈ ராம்
Traditional equality or LIKE queries fail in such cases. Even trigram
matching doesn’t fully address cross-script phonetic similarity.
2. What This Extension Does
- Detects the script of the input text
- Performs transliteration and normalization
- Generates a phonetic key
- Uses Levenshtein distance (via python-Levenshtein)
- Returns similarity-scored results
All of this happens inside PostgreSQL using PL/Python (plpython3u).
3. Key Features
- No schema changes required
- Query-level matching
- Supports 11 major Indian scripts:
Devanagari, Tamil, Telugu, Bengali, Urdu, Malayalam, Kannada, Odia,
Gujarati, Punjabi
- Works on existing tables
4. Requirements
- PostgreSQL 17 (compiled with Python support)
- Python 3.12+
- plpython3u
- Python packages:
pip install indic-transliteration python-Levenshtein
5. Example Usage
-----------------------------------------------------------------------------------------------------------------------------
postgres=#
SELECT * FROM fuzzy_match('names_native_dist', 'name', 'Rahul')
WHERE distance <= 1;
id | name | translit | normalized | fuzzy | distance
----+-------+----------+------------+-------+----------
1 | राहुल | rAhula | rahul | rahul | 0
2 | রাহুল | rAhula | rahul | rahul | 0
4 | ರಾಹುಲ್ | rAhul | rahul | rahul | 0
5 | Rahul | Rahul | rahul | rahul | 0
(4 rows)
--------------------------------------------------------------------------------------------------------------------------------
6. Feedback Requested
I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a starting
point for discussion about multilingual data handling inside PostgreSQL.
Looking forward to your thoughts and critiques.
Thank you!
Regards
Blessy Thomas
| Attachment | Content-Type | Size |
|---|---|---|
| Screenshot from 2026-03-02 12-29-45.png | image/png | 73.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Aayra Singh | 2026-03-23 10:16:05 | PostGre SQL/ HCL |
| Previous Message | Alexandre Felipe | 2026-03-20 13:43:43 | Re: Index scan with bitmap filter - has this been explored |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Richard Guo | 2026-03-23 06:12:52 | Re: Remove inner joins based on foreign keys |
| Previous Message | Fujii Masao | 2026-03-23 05:50:52 | Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? |