IN list processing performance (yet again)

From: Dave Tenny <tenny(at)attbi(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: IN list processing performance (yet again)
Date: 2003-05-28 12:51:49
Message-ID: 3ED4B0E5.3050703@attbi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Having grepped the web, it's clear that this isn't the first or last
time this issue will be raised.

My application relies heavily on IN lists. The lists are primarily
constant integers, so queries look like:

SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)

Performance is critical, and the size of these lists depends a lot on
how the larger 3-tier applicaiton is used,
but it wouldn't be out of the question to retrieve 3000-10000 items.

PostgreSQL 7.3.2 seems to have a lot of trouble with large lists.

I ran an experiment that ran queries on a table of two integers (ID,
VAL), where ID is a primary key and the subject
of IN list predicates. The test used a table with one million rows ID
is appropriately indexed,
and I have VACUUMED/analyzed the database after table load.

I ran tests on in-lists from about 100 to 100,000 entries.

I also ran tests where I picked the rows out one-by-one using
parameterized statements, e.g.

SELECT val FROM table WHERE id = ?

I'm trying to decide how much I should use parameterized statements and
when to work around buffer size limitations
in JDBC transport, general query processing, etc.

So here are my questions as a prelude to the data noted below:

1) What is the acceptable limit of jdbc Statement buffer sizes for
executeQuery()?
(Presumably it's not really a JDBC question, but a PostgreSQL
query/buffering/transport question).

2) What is the expected acceptable limit for the number of items in an
IN list predicate such as
those used here. (List of constants, not subselects).

3) What should I expect for performance capabilities/limitations from
PostgreSQL on this type of problem?
(Set my expectations, perhaps they're unrealistic).

4) What are my alternatives for picking specific rows for thousands of
elements with sub-second response times
if not IN lists? (This is crucial!)

---------------------------------------------

Here is a summary of my observations of query times, and I've attached
the test program (more on that below).

1) PostgreSQL exhibits worse-than-linear performance behavior with
respect to IN list size.
This is bad.

2) Parameterized statements exhibit the expected linear performance
characteristics.

3) The break-even point for using IN lists vs. parameterized statements
in my environment

(RedHat Linux 9.0, PostgreSQL 7.3.2, 512MB memory, 7200RPM 100UDMA
IDE disks, AMD1600Mhz)

is about 700 items in the IN list. Beyond that, IN the IN list
scalability curve makes it impractical.

4) God help you if you haven't vacuum/analyzed that the newly loaded
table. Without this,
IN list processing is even worse!

For just 10 elements in the IN list:

*Without* VACUUMDB, IN lists suck beyond measure:
Elapsed time for 10 IN list elements: 2638 ms
Elapsed time for 10 parameterized elements: 9 ms

*With* VACUUMDB: IN lists recover a bit:
Elapsed time for 10 IN list elements: 10 ms
Elapsed time for 10 parameterized elements: 24 ms

However it's VERY interesting to note that parameterized statements
worked well. That implies
probable disparity in plan generation. It's worth noting that it
didn't *feel* like I was getting the same
delay when I ran the query from the 'psql' client, but since it
doesn't report times I can't be sure.

5) Rest of my results are vacuumed, (and listed in the attached program
in detail).

The interesting points are:

For an IN list of 700 elements:

MySQL 3.23.56 (with INNODB tables) takes 19ms, 73ms with
parameterized statements.
PostgreSQL takes 269ms, 263ms with parameterized statements.

For larger lists, MySQL happily processed a 90,000 element IN list
in 1449ms,
9283 ms using parameterized statements.

PostgreSQL craps out trying to process 8000 elements with the error:
out of free buffers: time to abort!

PostgreSQL takes 45,566ms for 7000 elements in an IN list (ouch!)
, and 2027ms for a parameterized statement.
MySQL easily beats that with 10 times the data.

6) Using a remote client on the lan (10/100) to run the client piece on
a separate machine from the database
server yielded little change int he results. Prepared statements
worked pretty well even with actual wire latency,
surprise! (Of course it's very little latency on my lan, not like,
say, the database server running in a different city
which customers have been known to do).

The MySQL and PostgreSQL installations are the default RedHat 9.0
distribution packages,
I haven't tweaked either's installation parameters. (though MySQL was
updated by RedHat from 3.23.54a to 3.23.56
as part of an automatic upgrade).

My goal here isn't to say "why aren't you like MySQL". I've been using
PostgreSQL for a year as the development database of choice
with satisfactory results. But I won't be able to support customers who
want to use PostgreSQL on large deployments of my products
if I can't selectively retrieve data for several thousand elements in
sub-second times.

(PostgreSQL devos, if you want a feel good bullet, note that I don't
support MySQL at all since lack of MVCC transactions
is a showstopper from a multi-user performance standpoint).

So I'm looking for (a) solutions, answers to my questions above, and (b)
a statement of "we're on this" or "you're stuck with it" from
PostgreSQL devos who know.

----------------------------------------------------------------
On the attached program. (a Java JDBC program) Sorry, you can't just run
it "as is". Search for formfeeds (^L) if you want
to skip all the result data I logged. Compilation is straightforward,
simply supply the location of your JDBC jar file for compiling and running
(mine is noted in the program).

First, move the "if (false)" element below the table creation statements
and run the program to create the table.
Then VACUUM/analyze your database (suuuuure would be nice not to have to
vacuum).
Then move the "if (false)" element above the table creation so you won't
have to do it every time.
Then move past the formfeed and adjust the 'tryAmount' for loop to test
the amounts you're interested.
100 to 1000 by 100's is a good starting point.

Ignore the section (part of the if (false) logic) that attempts to
figure out what the largest query size is.
Unless you want to reproduce a hung postmaster in a CPU loop, which is
what I got when I tried to run that logic,
though whan I ran it I was missing the closing ')' in my IN list, which
I've since added to that code.

Thanks for any help!

Dave

Attachment Content-Type Size
INList.java text/plain 32.4 KB

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Shridhar Daithankar 2003-05-28 13:16:38 Re: IN list processing performance (yet again)
Previous Message Peter Lavender 2003-05-28 11:28:15