[PATCH] Using pread instead of lseek (with analysis)

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: pgsql-patches(at)postgresql(dot)org
Subject: [PATCH] Using pread instead of lseek (with analysis)
Date: 2005-10-08 15:48:56
Message-ID: 20051008154852.GC30988@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Hi,

Below is a patch that tries to determine (for linux) if the pread is
emulated or not (using an autoconf test), and if so use it instead of
lseek. It doesn't remove all uses of lseek. For example, PostgreSQL
uses lseek to find the end of the file when extending. So in a bulk
load, there is one seek per block written anyway. Also, the xlog seems
to be completely seperate code.

Short summary: Actual benefit, (on Linux anyway) not distinguishable
from the noise.

Longer summary: It reduces CPU usage slightly (few percent user,
overall almost nothing), but the benefit is usually swamped by the
actual cost of the read from disk (system cache miss). I would actually
make the change because it simplifies the code a bit.

Comparisons on other systems may be worthwhile.

What follows is the detailed analysis of the change. My conclusion is
that on Linux at least where the syscall overhead is so low, it's not
worth the change for performance. It is cleaner code-wise IMHO.

----
Initially, the table is 1.5 million rows of random integer data. Worst
case scenario: ordered output run with seqscan disabled. Table is 66MB,
Index is 56MB and includes all three data columns. The queries took 98
+/- 1 seconds. Over the entire query it consistantly loaded about 55MB
of data from disk (about 50% system cache miss rate).

Query: select count(*) from (select * from bigtable order by a,b,c)

Exhibit A below is the vmstat output with 10 second averages. The
interesting columns are cpu us (user) and cpu sy (system). Note, system
is defined as stuff not user and not idle. Waiting for the I/O to
complete is counted as system, even though it's not doing all that
much. It shows a very, very slight reduction in user time, though it is
measurable and repeatable (both run a number of times with same
results).

What this tells me is that lseek time is swamped by actual read time,
so not a major benefit in most cases.

Exhibit B is the same but with only 1 million rows (Table 48MB, Index
29MB), this fitting in cache memory. The original index had some slack
from deleting records (I did a vacuum full though). The queries now
take 60 +/- 1 seconds. As you can see even without accessing the disk,
it still isn't making a big difference.

Exhibit C is the output of strace -c on the second test. As you can see
there is a difference, but marginal.

Exhibit D is the output of a seq scan, in case anyone figured I might
be buggering the queries somehow. As you can see, it shows that the
index scan is extremely expensive. Only 5900 reads for the seq scan
(enough to read the whole table) instead of 890000. The shared buffers
is the default 1000. Upping this would close the gap somewhat but that
is irrelevent for this test (lseek vs pread).

Exhibit E is the strace -c output, for completeness. lseek doesn't
appear in the top 10.

If you read this far, thanks for the attention! If you're running
another system, it would be nice to test. Remember to update the
autoconf code for your system!

Have a nice day,

Misc details:

command line:
time echo "select count(*) from(select * from bigtable order by a,b,c) x" | \
./postgres -f s -D /var/lib/postgres/data test
System:
Linux kleptog 2.6.8-1-686 #1 Thu Nov 25 04:34:30 UTC 2004 i686 GNU/Linux
128MB RAM
Intel(R) Celeron(TM) CPU 1000MHz

Exhibit (A): Index scan, dataset > memory

Using ordinary llseek/read:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 4 1372 940 107564 0 0 534 18 1070 145 18 56 0 27
1 0 4 1648 992 107260 0 0 548 3 1071 147 18 63 0 19
1 0 4 1552 984 107348 0 0 542 4 1069 144 20 69 0 11
1 0 4 1432 980 107496 0 0 550 6 1071 146 19 69 0 12
1 0 4 1648 664 107600 0 0 560 3 1071 148 19 67 0 15
1 0 4 1576 632 107728 0 0 597 3 1076 157 19 68 0 13
2 0 4 1768 640 107528 0 0 618 14 1080 163 19 67 0 14

Using pread():

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 4 1680 600 107616 0 0 534 18 1070 145 17 56 0 28
1 0 4 1656 644 107644 0 0 543 5 1070 147 16 64 0 19
1 0 4 1344 624 107976 0 0 540 2 1068 142 19 70 0 11
1 0 4 1656 632 107652 0 0 555 6 1071 147 17 70 0 13
1 0 4 1608 540 107796 0 0 564 13 1074 149 18 68 0 15
1 0 4 1392 504 108068 0 0 598 4 1076 158 17 69 0 14
0 1 4 1368 460 108120 0 0 626 3 1079 165 17 69 0 14

Exhibit (B): Index scan, dataset < memory

Using ordinary llseek/read:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 4 7500 3084 98000 0 0 0 25 1002 8 22 78 0 0
1 0 4 7484 3100 98000 0 0 0 3 1001 5 22 78 0 0
1 0 4 7468 3116 98000 0 0 0 3 1001 5 22 78 0 0
1 0 4 7444 3140 98000 0 0 0 10 1003 6 22 78 0 0

Using pread():

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 4 7116 3452 98000 0 0 0 18 1002 8 21 79 0 0
1 0 4 7100 3468 98000 0 0 0 8 1001 5 21 79 0 0
1 0 4 7080 3488 98000 0 0 0 4 1001 5 21 79 0 0
1 0 4 7064 3504 98000 0 0 0 10 1003 5 22 78 0 0

Exhibit (C): strace -c output of backend

Using ordinary llseek/read:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
85.27 51.906570 58 891526 read
14.71 8.956573 10 887770 _llseek
0.00 0.001983 17 114 29 open

Using pread():

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.98 52.977031 59 891479 pread
0.00 0.002164 114 19 write

Exhibit (D): sequential scan: dataset < memory

Using ordinary llseek/read:

$ time echo "select count(*) from(select * from bigtable) x" | ./postgres.read -D /var/lib/postgres/data test

PostgreSQL stand-alone backend 8.1beta2
backend> 1: count (typeid = 20, len = 8, typmod = -1, byval = f)
----
1: count = "1080838" (typeid = 20, len = 8, typmod = -1, byval = f)
----
backend>
real 0m1.264s
user 0m0.921s
sys 0m0.328s

Using pread():

$ time echo "select count(*) from(select * from bigtable) x" | ./postgres -D /var/lib/postgres/data test

PostgreSQL stand-alone backend 8.1beta2
backend> 1: count (typeid = 20, len = 8, typmod = -1, byval = f)
----
1: count = "1080838" (typeid = 20, len = 8, typmod = -1, byval = f)
----
backend>
real 0m1.213s
user 0m0.827s
sys 0m0.371s

Exhibit (E): strace -c of sequential scan case

Using ordinary llseek/read:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
96.78 0.318179 53 5950 read
0.68 0.002249 118 19 write
0.57 0.001879 17 109 29 open
<snip>
0.06 0.000197 5 40 _llseek

Using pread():

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
95.92 0.316512 54 5903 pread
0.70 0.002326 122 19 write
0.62 0.002032 406 5 fsync

--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment Content-Type Size
pread.patch text/plain 4.4 KB

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2005-10-08 16:48:28 test, please ignore
Previous Message Bruce Momjian 2005-10-08 15:30:02 Re: [HACKERS] Patching dblink.c to avoid warning about open