Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From: Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us>
To: mimo(at)interdata(dot)com(dot)pl (Michal Mosiewicz)
Cc: hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Sequential scan speed, mmap, disk i/o
Date: 1998-05-16 01:06:11
Message-ID: 199805160106.VAA26055@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > mmap() is very slow, perhaps because you are changing the process
> > virtual table maps for each chunk you read in, and faulting them in,
> > rather than using the file system for I/O.
>
> Huh, very slow? I wouldn't agree. I rewrote your mmap program to allow
> for using reads or mmaps.
>
> I tested it on 111MB file. I decided to use 8192 bytes buffer size
> (standard postgres page size). My system is Linux, P166, 64MBs of RAM
> (note that I have a lot of software running currently so the cache size
> is less than 25MBs. I also changed the for(j..) step size to j+=256 just
> to make sure that it won't influence the results too much and you will
> see the difference better. mmap was run with (PROT_READ, MAP_SHARED)
>
> Average results are (for sequential reading):
> Using reads: total time - 21.39 (0.44user, 6.09system, 31%CPU)
> Using mmaps: total time - 21.10 (0.57user, 4.92system, 25%CPU)
>
> Note, that in case of reads the program spends much more time in system
> calls and uses more CPU. You may notice that in case of Linux using mmap
> is about 20% cheapper than read. In case of random reading it's slightly
> more than 20% as I remember. Total time is in both cases similiar since
> the throughput limit of my HD.
>
> BTW. Are you sure, that your program was counting mmaps properly? When I
> run it on my system it counts much more than what it should. On my
> system offset crossed over file's boundary then it worked a minute or
> more before it stopped. I attach my version (with hardcoded 111MBs file
> size to prevent it, of course you may change it)

OK, here are my results using your test program:

Basically, Linux is double my speed for 8k mmap'ed chunks. Around 32k
chunks, I get closer, and 8mb chunks are the same. Glad to hear Linux
has optimized mmap() recently, because BSD/OS looks much slower than
Linux on this.

Now, why does PostgreSQL sequential scan a 160MB files in 37 seconds,
using standard its 8k buffers, when even your read test for me using 8k
buffers takes 54 seconds?

In storage/file/fd.c, I see it using read(), and I assume they are 8k
chunks being read:

returnCode = read(VfdCache[file].fd, buffer, amount);

Also attached is a modified version of my mmap() program, that uses
fstat() to check the file size to know when to stop. However, I have
also have modified it to use a file size to match your file size.

Not sure what to conclude from these numbers.

---------------------------------------------------------------------------

mmap, 8k
47.81 real 0.66 user 33.12 sys

read, 8k
54.60 real 0.51 user 46.80 sys

mmap, 32k
29.80 real 0.23 user 13.81 sys

read, 32k
26.80 real 0.12 user 14.82 sys

mmap, 8mb
21.25 real 0.03 user 5.49 sys

read, 8mb
20.43 real 0.14 user 3.60 sys

my mmap, 8k, your file size
64.67 real 15.99 user 34.00 sys

my mmap, 32k, your file size
43.12 real 15.95 user 14.29 sys

my mmap, 8mb, your file size
34.31 real 15.88 user 5.39 sys

---------------------------------------------------------------------------

#include <stdio.h>
#include <fcntl.h>
#include <assert.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>

#define MMAP_SIZE 8192 * 1024

int main(int argc, char *argv[], char *envp[])
{
int i, j, fd, spaces = 0;
int off;
char *addr;
struct stat filestat;

fd = open("/u/pg/data/base/test/test", O_RDONLY, 0);
assert(fd != -1);
assert(fstat(fd, &filestat) == 0);

filestat.st_size = 111329280;

for (off = 0; 1; off += MMAP_SIZE)
{
addr = mmap(0, MMAP_SIZE, PROT_READ, MAP_SHARED, fd, off);
assert(addr != NULL);
madvise(addr, MMAP_SIZE, MADV_SEQUENTIAL);

for (j = 0; j < MMAP_SIZE; j++)
{
if (*(addr + j) != ' ')
spaces++;
if (off + j + 1 == filestat.st_size)
goto done;
}
munmap(addr,MMAP_SIZE);
}
done:
printf("%d\n",spaces);
return 0;
}

--
Bruce Momjian | 830 Blythe Avenue
maillist(at)candle(dot)pha(dot)pa(dot)us | Drexel Hill, Pennsylvania 19026
+ If your life is a hard drive, | (610) 353-9879(w)
+ Christ can be your backup. | (610) 853-3000(h)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas G. Lockhart 1998-05-16 02:00:07 Re: CREATE DATABASE
Previous Message Michal Mosiewicz 1998-05-16 00:08:27 Async I/O