Re: Some regular-expression performance hacking

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-26 03:51:47
Message-ID: 221848.1614311507@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> So I rearranged things to allocate arcs out of a common pool, and for
> good measure made the state allocation code do the same thing. I was
> pretty much blown away by the results: not only is the average-case
> space usage about half what it is on HEAD, but the worst-case drops
> by well more than a factor of ten.

BTW, I was initially a bit baffled by how this could be. Per previous
measurements, the average number of arcs per state is around 4; so
if the code is allocating ten arcs for each state right off the bat,
it's pretty clear how we could have a factor-of-two-or-so bloat
problem. And I think that does explain the average-case results.
But it can't possibly explain bloats of more than 10x.

After further study I think this is what explains it:

* The "average number of arcs" is pretty misleading, because in
a large NFA some of the states have hundreds of out-arcs, while
most have only a couple.

* The NFA is not static; the code moves arcs around all the time.
There's actually a function (moveouts) that deletes all the
out-arcs of a state and creates images of them on another state.
That operation can be invoked a lot of times during NFA optimization.

* Once a given state has acquired N out-arcs, it keeps that pool
of arc storage, even if some or all of those arcs get deleted.
Indeed, the state itself could be dropped and later recycled,
but it still keeps its arc pool. Unfortunately, even if it does
get recycled for re-use, it's likely to be resurrected as a state
with only a couple of out-arcs.

So I think the explanation for 20x or 30x bloat arises from the
optimize pass resulting in having a bunch of states that have large
but largely unused arc pools. Getting rid of the per-state arc pools
in favor of one common pool fixes that nicely.

I realized while looking at this that some cycles could be shaved
from moveouts, because there's no longer a reason why it can't just
scribble on the arcs in-place (cf. now-obsolete comment on
changearctarget()). It's late but I'll see about improving that
tomorrow.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message miyake_kouta 2021-02-26 04:18:20 [PATCH] pgbench: Bug fix for the -d option
Previous Message Thomas Munro 2021-02-26 03:13:54 Re: pgsql: pg_collation_actual_version() -> pg_collation_current_version().