Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: speed up verifying UTF-8
Date: 2021-08-24 16:00:28
Message-ID: CAFBsxsG-ThqJke8UOOR3m8gYYRbDVTi_MO5+xPnr7vE1eO=fmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Naively, the shift-based DFA requires 64-bit integers to encode the
transitions, but I recently came across an idea from Dougall Johnson of
using the Z3 SMT solver to pack the transitions into 32-bit integers [1].
That halves the size of the transition table for free. I adapted that
effort to the existing conventions in v22 and arrived at the attached
python script. Running the script outputs the following:

$ python dfa-pack-pg.py
offsets: [0, 11, 16, 1, 5, 6, 20, 25, 30]
transitions:
00000000000000000000000000000000 0x0
00000000000000000101100000000000 0x5800
00000000000000001000000000000000 0x8000
00000000000000000000100000000000 0x800
00000000000000000010100000000000 0x2800
00000000000000000011000000000000 0x3000
00000000000000001010000000000000 0xa000
00000000000000001100100000000000 0xc800
00000000000000001111000000000000 0xf000
01000001000010110000000000100000 0x410b0020
00000011000010110000000000100000 0x30b0020
00000010000010110000010000100000 0x20b0420

I'll include something like the attached text file diff in the next patch.
Some comments are now outdated, but this is good enough for demonstration.

[1] https://gist.github.com/dougallj/166e326de6ad4cf2c94be97a204c025f
--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v22-addendum-32-bit-transitions.txt text/plain 2.2 KB
dfa-pack-pg.py text/x-python-script 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-08-24 16:04:00 Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
Previous Message Robert Haas 2021-08-24 15:37:15 Re: Parallel scan with SubTransGetTopmostTransaction assert coredump