Re: A worst case for qsort

From: John Cochran <j69cochran(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: A worst case for qsort
Date: 2014-08-07 17:00:19
Message-ID: CAGQU3n5YjS-iayG6Ah5_sBV5nhtw+zxptDNrdq8YZgP=7f5fiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 7, 2014 at 11:07 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Aug 5, 2014 at 8:15 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> > "The adversarial method works for almost any polymorphic program
> > recognizable as quicksort. The subject quicksort may copy values at
> > will, or work with lists rather than arrays. It may even pick the
> > pivot at random. The quicksort will be vulnerable provided only that
> > it satisfies some mild assumptions that are met by every
> > implementation I have seen".
> >
> > IMHO, while worst case performance is a very useful tool for analyzing
> > algorithms (particularly their worst case time complexity), a worst
> > case should be put in its practical context. For example, if we had
> > reason to be concerned about *adversarial* inputs, I think that there
> > is a good chance that our qsort() actually would be problematic to the
> > point of driving us to prefer some generally slower alternative.
>

If one is concerned about adversarial inputs, then as mentioned earlier,
two main steps need to be taken.
1. Don't permit potential adversaries define the comparison function used
by qsort().
2. Perform random selection of pivot to prevent precomputed lists of data
that invoke quadratic behavior in qsort.

And before the argument that a random pivot will be less efficient than the
median of median that's currently being used, you need to look at what is
currently being used.
1. If a "small" partition is being sorted, the element at n/2 is selected
as the pivot with n being the size of the partition.
2. If a "medium" partition is being sorted, then pivot selected is the
median of the elements at 0, n/2, and n.
3. And finally, if a "large" partition is being sorted, potential pivots
are selected from 0, n/8, n/4, 3n/8, n/2, 5n/8, 3n/4, 7n/8, and n. Of
those 9 candidates, the median of medians is selected as the pivot.

There is nothing in the world that would prevent the following.
1. Short partition - select a pivot at random.
2. Medium partition - select the median of 3 randomly selected candidates.
3. Large partition - select the median of medians of 9 randomly selected
candidates.

Using the above random pivot selection methods, the performance of a
hardened qsort should be close to that of an unhardened qsort. The only
real difference is the cost of generating random numbers to select the
candidates for the median tests. However, if an malicious compare function
is permitted to be defined, it would still be vulnerable to that attack,
but it would be pretty much immune to a precomputed data attack.

Or perhaps it just might be simpler to detect an attack and fall back to a
sort algorithm that doesn't have quadratic behavior.

What I mean by that is the qsort in the "Engineering a Sort Function" has a
big O of approximately 1.1 n ln n comparisons. If upon starting qsort, it
computes an upper bound of comparisons it's willing to perform. Say 3 n ln
n or so (1.386 n log n is the estimate for a randomly selected pivot). Then
if the upper bound is reached, the qsort function backs out leaving the
array in a partially sorted order, and then calls a slower sort that
doesn't exhibit quadratic behavior to actually finish the sort.

The result of doing that would be most, if not all, sorts being handled by
qsort in a timely fashion. But if bad luck or an adversary strikes, the
qsort bails out after things look bad and passes the task to a different
sort. I'll be the first to admit that the hybrid approach would be slower
in those bad cases than it would be if the alternate sort was selected at
the beginning, but it would be faster than letting qsort continue with
quadratic behavior.

--

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2014-08-07 17:16:15 Re: A worst case for qsort
Previous Message Abhijit Menon-Sen 2014-08-07 16:07:37 Re: replication commands and log_statements