Re: Specifying many rows in a table

From: Steve Atkins <steve(at)blighty(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Specifying many rows in a table
Date: 2004-01-29 03:24:05
Message-ID: 20040129032405.GA11231@gp.word-to-the-wise.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Jan 28, 2004 at 04:14:21PM -0800, Steve Atkins wrote:
> I'm trying a number of variants (in(list), in(select), exists, join) and
> benchmarking all of them across a range of numbers, randomly chosen or
> clustered with appropriate splashing of the DB and disk buffers between
> runs.
>
> Once the perl script has finished I'll post a pointer to the data.

On the off-chance anyone else is interested in the performance
variations (or is searching the archives from the year 2020)...

I have a table of 10,000,000 rows, each row having a few hundred bytes
of data. It has an integer primary key, and is clustered on that key.
Real data, not synthetic, not that it makes a difference for this
test.

I have in the application layer a list of the rows I want to retrieve
(a sorted list of unique integers) and want to retrieve all those rows
from the big table.

The test system is a single processor Athlon 1800 with a Reiser
filesystem on two striped decent scsi drives. PostgreSQL is pretty
well tuned. Between each test run PostgreSQL was shut down, a gig of
data read from the disk to flush OS caches and PostgreSQL restarted.

The big table is indexed and analysed. All temporary tables are analysed
(and times include construction, population, any indexing and analysing
of the table, but not destruction of it, nor time to transfer data to the
client).

The sequence of ids to be retrieved is identical in each run. In the
first batch of tests it's completely random, but sorted. In the second batch
it's a sequence (42, 43, 44, 45...) with a random starting point.

inscalar: select * from big where id in (1, 17, 43 ...., 2071)
inselect: select * from big where id in (select id from small)
exists: select * from big where exists (select 1 from small where id=big.id)
join: select * from big, small where big.id = small.id;

inselect-idx: As inselect, but small is indexed
exists-idx: As exists, but small is indexed
join-idx: As join, but small is indexed

In all but the first case, small is created as a temporary table, populated
with a sequence of inserts (inside a transaction), possibly indexed then
analysed.

number of random rows retrieved
1 10 100 1000 10000 100000
inscalar 0.394 0.818 3.80 23.5
inselect 0.752 0.770 0.751 0.977 0.798 0.790
join 0.728 0.746 0.625 0.766 0.703 0.806
inselect-idx 0.655 0.642 0.748 0.822 0.777 0.820
exists-idx 50.7 49.4 49.4 50.0 49.4 51.2
join-idx 0.657 0.703 0.760 0.730 0.731 0.815

number of sequential rows retrieved
1 10 100 1000 10000 100000
inscalar 0.464 0.474 0.488 0.529
inselect 0.820 0.826 0.778 0.776 0.612 0.770
join 0.754 0.855 0.831 0.829 0.899 0.742
inselect-idx 0.686 0.679 0.703 0.747 0.747 0.729
exists-idx 49.2 79.8 49.5 49.3 49.3 53.7
join-idx 0.625 0.754 0.697 0.770 0.750 0.783

Overall, IN(scalar) looks pretty good for single digit numbers. A simple
join looks nice for anything else - and indexing the temporary table
doesn't seem to buy anything.

Pretty impressive overall, though.

Cheers,
Steve

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Don Bowman 2004-01-29 04:20:30 select ... distinct performance
Previous Message Craig Addleman 2004-01-29 02:57:47 Pl/tcl auto-load problem