Re: Have we out-grown Flex?

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Have we out-grown Flex?
Date: 2012-05-02 11:47:01
Message-ID: CAEYLb_XhepbLO3V49ZQF900fvHbPvm-Gga7Ub5DNMNWwO-KOjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2 May 2012 04:24, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think Tom's question of whether the parser or lexer is the problem
> is something that ought to be investigated.  Personally, I suspect
> that our tendency to use lists everywhere, for everything, an artifact
> of our proud LISP heritage, may be costing us dearly in terms of parse
> time.  However, there's a difference between suspicion and proof, and
> I certainly have none of the latter.

It's funny that you should say that, because I actually had a
discussion with Greg Stark over drinks about this recently. John
Bentley has described an experiment that undermines many traditional
beliefs about the trade-offs represented by using a linked list rather
than an array. The test is, using a modern computer, generate N
integers at random and insert them in order into a sequence. Then,
randomly remove integers from the sequence, one at a time. What is the
performance at different sizes of N when the sequence is a
doubly-linked list, and when it is an array? If you graph the two, the
results are perhaps rather surprising. I think his graph went up to
100,000. The initial result shows a line representing an array down
near the bottom of the graph. The list line looks exponential. Even if
you use memory pooling so the list doesn't have to allocate memory as
needed, the array still roundly beats the list, albeit not quite so
convincingly and without the list hitting the roof near 100,000. I
don't think that I need to point out that this is for inserting and
deleting, and that's when you're supposed to use lists.

The point is that on modern architectures, with many layers of cache,
the cost of the linear search to get the insertion point completely
dominates - this is without the array availing of a binary search, in
the interest of fairness. CPU caches happen to do a very good job of
moving over on-average half of the array for inserting elements at
random points. The list is much larger than the array, with the two
extra pointers per element (yeah, I know that we use singly linked
lists, but we have other disadvantages compared to typical C lists),
which matters. Predictable usage patterns - that is, temporal and
spatial locality, resulting in good usage of the memory hierarchy
matters a lot. We're not talking about a small difference, either. I
think the difference in the published test was something like the
array was 50 - 100 times faster. The list results in far more cache
misses than the array. So, I'm right there with you - using lists
everywhere is bad news.

As for the question of Flex/Quex, I'm not in an immediate position to
sink any more time into it, but it remains on my list of things to
pursue for 9.3, though it's only about number 3 on that list right
now.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-05-02 12:14:36 Re: extending relations more efficiently
Previous Message Hannu Krosing 2012-05-02 11:40:34 How hard would it be to support LIKE in return declaration of generic record function calls ?