pgsql: Use radix tree for character encoding conversions.

From: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Use radix tree for character encoding conversions.
Date: 2017-03-13 18:47:23
Message-ID: E1cnV07-0007li-6D@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Use radix tree for character encoding conversions.

Replace the mapping tables used to convert between UTF-8 and other
character encodings with new radix tree-based maps. Looking up an entry in
a radix tree is much faster than a binary search in the old maps. As a
bonus, the radix tree representation is also more compact, making the
binaries slightly smaller.

The "combined" maps work the same as before, with binary search. They are
much smaller than the main tables, so it doesn't matter so much. However,
the "combined" maps are now stored in the same .map files as the main
tables. This seems more clear, since they're always used together, and
generated from the same source files.

Patch by Kyotaro Horiguchi, with lot of hacking by me at various stages.
Reviewed by Michael Paquier and Daniel Gustafsson.

Discussion: https://www.postgresql.org/message-id/20170306.171609.204324917.horiguchi.kyotaro%40lab.ntt.co.jp

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/aeed17d00037950a16cc5ebad5b5592e5fa1ad0f

Modified Files
--------------
src/backend/utils/mb/Unicode/Makefile | 10 +-
src/backend/utils/mb/Unicode/UCS_to_BIG5.pl | 12 +-
src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl | 10 +-
.../utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl | 22 +-
src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl | 189 +-
src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl | 14 +-
src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl | 10 +-
src/backend/utils/mb/Unicode/UCS_to_GB18030.pl | 10 +-
src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl | 12 +-
.../utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl | 21 +-
src/backend/utils/mb/Unicode/UCS_to_SJIS.pl | 32 +-
src/backend/utils/mb/Unicode/UCS_to_UHC.pl | 12 +-
src/backend/utils/mb/Unicode/UCS_to_most.pl | 6 +-
src/backend/utils/mb/Unicode/big5_to_utf8.map | 18321 ++------
src/backend/utils/mb/Unicode/convutils.pm | 806 +-
src/backend/utils/mb/Unicode/euc_cn_to_utf8.map | 9723 +----
.../utils/mb/Unicode/euc_jis_2004_to_utf8.map | 14744 ++-----
.../mb/Unicode/euc_jis_2004_to_utf8_combined.map | 29 -
src/backend/utils/mb/Unicode/euc_jp_to_utf8.map | 17337 ++------
src/backend/utils/mb/Unicode/euc_kr_to_utf8.map | 10723 ++---
src/backend/utils/mb/Unicode/euc_tw_to_utf8.map | 31407 ++++----------
src/backend/utils/mb/Unicode/gb18030_to_utf8.map | 41882 +++++--------------
src/backend/utils/mb/Unicode/gbk_to_utf8.map | 28344 +++----------
.../utils/mb/Unicode/iso8859_10_to_utf8.map | 237 +-
.../utils/mb/Unicode/iso8859_13_to_utf8.map | 237 +-
.../utils/mb/Unicode/iso8859_14_to_utf8.map | 237 +-
.../utils/mb/Unicode/iso8859_15_to_utf8.map | 237 +-
.../utils/mb/Unicode/iso8859_16_to_utf8.map | 237 +-
src/backend/utils/mb/Unicode/iso8859_2_to_utf8.map | 205 +-
src/backend/utils/mb/Unicode/iso8859_3_to_utf8.map | 198 +-
src/backend/utils/mb/Unicode/iso8859_4_to_utf8.map | 205 +-
src/backend/utils/mb/Unicode/iso8859_5_to_utf8.map | 237 +-
src/backend/utils/mb/Unicode/iso8859_6_to_utf8.map | 158 +-
src/backend/utils/mb/Unicode/iso8859_7_to_utf8.map | 234 +-
src/backend/utils/mb/Unicode/iso8859_8_to_utf8.map | 201 +-
src/backend/utils/mb/Unicode/iso8859_9_to_utf8.map | 205 +-
src/backend/utils/mb/Unicode/johab_to_utf8.map | 23327 +++--------
src/backend/utils/mb/Unicode/koi8r_to_utf8.map | 237 +-
src/backend/utils/mb/Unicode/koi8u_to_utf8.map | 237 +-
.../utils/mb/Unicode/shift_jis_2004_to_utf8.map | 14503 ++-----
.../mb/Unicode/shift_jis_2004_to_utf8_combined.map | 29 -
src/backend/utils/mb/Unicode/sjis_to_utf8.map | 10202 ++---
src/backend/utils/mb/Unicode/uhc_to_utf8.map | 23788 +++--------
src/backend/utils/mb/Unicode/utf8_to_big5.map | 17809 ++------
src/backend/utils/mb/Unicode/utf8_to_euc_cn.map | 11487 ++---
.../utils/mb/Unicode/utf8_to_euc_jis_2004.map | 23868 ++++++-----
.../mb/Unicode/utf8_to_euc_jis_2004_combined.map | 29 -
src/backend/utils/mb/Unicode/utf8_to_euc_jp.map | 20314 ++++-----
src/backend/utils/mb/Unicode/utf8_to_euc_kr.map | 14617 +++----
src/backend/utils/mb/Unicode/utf8_to_euc_tw.map | 24574 +++--------
src/backend/utils/mb/Unicode/utf8_to_gb18030.map | 40292 +++++-------------
src/backend/utils/mb/Unicode/utf8_to_gbk.map | 26061 ++----------
.../utils/mb/Unicode/utf8_to_iso8859_10.map | 240 +-
.../utils/mb/Unicode/utf8_to_iso8859_13.map | 239 +-
.../utils/mb/Unicode/utf8_to_iso8859_14.map | 272 +-
.../utils/mb/Unicode/utf8_to_iso8859_15.map | 227 +-
.../utils/mb/Unicode/utf8_to_iso8859_16.map | 257 +-
src/backend/utils/mb/Unicode/utf8_to_iso8859_2.map | 240 +-
src/backend/utils/mb/Unicode/utf8_to_iso8859_3.map | 232 +-
src/backend/utils/mb/Unicode/utf8_to_iso8859_4.map | 240 +-
src/backend/utils/mb/Unicode/utf8_to_iso8859_5.map | 229 +-
src/backend/utils/mb/Unicode/utf8_to_iso8859_6.map | 171 +-
src/backend/utils/mb/Unicode/utf8_to_iso8859_7.map | 248 +-
src/backend/utils/mb/Unicode/utf8_to_iso8859_8.map | 194 +-
src/backend/utils/mb/Unicode/utf8_to_iso8859_9.map | 226 +-
src/backend/utils/mb/Unicode/utf8_to_johab.map | 23380 +++--------
src/backend/utils/mb/Unicode/utf8_to_koi8r.map | 301 +-
src/backend/utils/mb/Unicode/utf8_to_koi8u.map | 312 +-
.../utils/mb/Unicode/utf8_to_shift_jis_2004.map | 18954 ++++-----
.../mb/Unicode/utf8_to_shift_jis_2004_combined.map | 29 -
src/backend/utils/mb/Unicode/utf8_to_sjis.map | 11648 ++----
src/backend/utils/mb/Unicode/utf8_to_uhc.map | 23612 +++--------
src/backend/utils/mb/Unicode/utf8_to_win1250.map | 266 +-
src/backend/utils/mb/Unicode/utf8_to_win1251.map | 259 +-
src/backend/utils/mb/Unicode/utf8_to_win1252.map | 267 +-
src/backend/utils/mb/Unicode/utf8_to_win1253.map | 244 +-
src/backend/utils/mb/Unicode/utf8_to_win1254.map | 276 +-
src/backend/utils/mb/Unicode/utf8_to_win1255.map | 260 +-
src/backend/utils/mb/Unicode/utf8_to_win1256.map | 320 +-
src/backend/utils/mb/Unicode/utf8_to_win1257.map | 259 +-
src/backend/utils/mb/Unicode/utf8_to_win1258.map | 284 +-
src/backend/utils/mb/Unicode/utf8_to_win866.map | 280 +-
src/backend/utils/mb/Unicode/utf8_to_win874.map | 225 +-
src/backend/utils/mb/Unicode/win1250_to_utf8.map | 232 +-
src/backend/utils/mb/Unicode/win1251_to_utf8.map | 236 +-
src/backend/utils/mb/Unicode/win1252_to_utf8.map | 232 +-
src/backend/utils/mb/Unicode/win1253_to_utf8.map | 220 +-
src/backend/utils/mb/Unicode/win1254_to_utf8.map | 230 +-
src/backend/utils/mb/Unicode/win1255_to_utf8.map | 214 +-
src/backend/utils/mb/Unicode/win1256_to_utf8.map | 237 +-
src/backend/utils/mb/Unicode/win1257_to_utf8.map | 225 +-
src/backend/utils/mb/Unicode/win1258_to_utf8.map | 228 +-
src/backend/utils/mb/Unicode/win866_to_utf8.map | 237 +-
src/backend/utils/mb/Unicode/win874_to_utf8.map | 204 +-
src/backend/utils/mb/conv.c | 251 +-
.../conversion_procs/utf8_and_big5/utf8_and_big5.c | 4 +-
.../utf8_and_cyrillic/utf8_and_cyrillic.c | 8 +-
.../utf8_and_euc2004/utf8_and_euc2004.c | 6 +-
.../utf8_and_euc_cn/utf8_and_euc_cn.c | 4 +-
.../utf8_and_euc_jp/utf8_and_euc_jp.c | 4 +-
.../utf8_and_euc_kr/utf8_and_euc_kr.c | 4 +-
.../utf8_and_euc_tw/utf8_and_euc_tw.c | 4 +-
.../utf8_and_gb18030/utf8_and_gb18030.c | 4 +-
.../conversion_procs/utf8_and_gbk/utf8_and_gbk.c | 4 +-
.../utf8_and_iso8859/utf8_and_iso8859.c | 75 +-
.../utf8_and_johab/utf8_and_johab.c | 4 +-
.../conversion_procs/utf8_and_sjis/utf8_and_sjis.c | 4 +-
.../utf8_and_sjis2004/utf8_and_sjis2004.c | 6 +-
.../conversion_procs/utf8_and_uhc/utf8_and_uhc.c | 4 +-
.../conversion_procs/utf8_and_win/utf8_and_win.c | 54 +-
src/include/mb/pg_wchar.h | 84 +-
111 files changed, 147742 insertions(+), 367346 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Peter Eisentraut 2017-03-13 19:44:14 pgsql: Change xlog to WAL in some error messages
Previous Message Heikki Linnakangas 2017-03-13 17:08:38 pgsql: Remove obsolete references to JIS0201.TXT JIS0208.TXT.