Skip site navigation (1) Skip section navigation (2)

Timing overhead and Linux clock sources

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Timing overhead and Linux clock sources
Date: 2011-12-07 02:58:06
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
Over in the "add timing of buffer I/O requests" thread I mentioned 
having a system where EXPLAIN ANALYZE of a modest COUNT(*) takes 10X as 
long as just executing the statement.  Attached is a bit of SQL and a 
script that runs it multiple times that demonstrate the problem on 
systems that have it.  It assumes you've installed the Dell Store 2 
database.  I'd expect it to work with simpler example too (just create 
something with 20,000 rows), but I haven't actually confirmed that yet.

 From what I've been able to measure so far, here's the state of things:

-If you have a system with a working TSC clock source (timing data is 
pulled right from the CPU), timing overhead is reasonable enough that 
you might turn it on even for things that happen frequently, such as the 
buffer I/O timing patch enables.

-Some systems have tsc, but it doesn't work right in multi-core setups.  
Newer kernel versions know this and disable it, older ones let you pick 
tsc anyway and bad things occur.

-Using any of the other timing mechanism--hpt, acpi_pm, and jiffies--has 
extremely high timing overhead.

This doesn't conflict with the comment we got from Martijn van 
Oosterhout in the other thread:  "Learned the hard way while tracking 
clock-skew on a multicore system.  The hpet may not be the fastest (that 
would be the cpu timer), but it's the fastest (IME) that gives 
guaranteed monotonic time."  There's more background about TSC, HPET,  
and an interesting caveat about VMs I came across:

I'm starting to envision what a new documentation section discussing 
timer overhead would look like now, for Linux at least.  And I'm 
guessing other people can simulate this problem even on hardware that 
doesn't have it in the default config, simply by manually switching to 
one of the slower timing methods at boot.

The test case is just doing this:

select count(*) from customers;
explain analyze select count(*) from customers;

Lots of times, with some patterns to try and avoid caching effects.

The system with the worst clock timing issues I've found, the one that 
was so obvious it kicked off this investigation, is my Thinkpad T60 
laptop.  The overhead is worst on the old 32-bit Ubuntu 9.04 
installation I have there.  Here's some info about the system, starting 
with the 4 clock sources it supports:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm jiffies tsc
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
$ uname -a
Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC 
2010 i686 GNU/Linux
$ cat /etc/lsb-release
$ psql -c "select version()"
  PostgreSQL 8.3.11 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real 
(Ubuntu 4.3.3-5ubuntu4) 4.3.3

You change the time source by adding a stanza like this to the kernel 
boot options:


Martijn said you can just cat a new clock source into the setting 
instead, I haven't evaluated that yet.  Wanted to get my baselines done 
in a situation where the kernel boots using the one I wanted, which 
turns out to be useful for a number of reasons.  Boot time is one of the 
things that varied a lot when the timer was switched.

The hpet time source is the default on my T60 system, but there are 3 
others available.  Of those, it appears that only acpi_pm is a viable 
alternative.  I gave up on the system booting after 15 minutes when 
using jiffies.  I could boot with tsc, but everything was erratic.  I 
could barely use the mouse for example.  tsc did have a much faster 
clock as measured by EXPLAIN overhead though.  Here are the results, two 
runs of the test program that I've de-interleaved so it's a block of 
fast (count only) runs then a block of slow (with EXPLAIN ANALYZE):

Time Source:hpet

Time: 14.783 ms
Time: 9.021 ms
Time: 9.170 ms
Time: 8.634 ms
Time: 5.632 ms
Time: 5.520 ms
Time: 15.276 ms
Time: 9.477 ms
Time: 9.485 ms

Time: 70.720 ms
Time: 67.577 ms
Time: 57.959 ms
Time: 74.661 ms
Time: 73.557 ms
Time: 74.963 ms
Time: 87.898 ms
Time: 82.125 ms
Time: 81.443 ms

Time Source:acpi_pm

Time: 8.124 ms
Time: 5.020 ms
Time: 4.887 ms
Time: 8.125 ms
Time: 5.029 ms
Time: 4.881 ms

Time: 82.986 ms
Time: 82.366 ms
Time: 82.609 ms
Time: 83.089 ms
Time: 82.438 ms
Time: 82.539 ms

Time Source:tsc
Time: 8.371 ms
Time: 4.673 ms
Time: 4.901 ms
Time: 8.409 ms
Time: 4.943 ms
Time: 4.722 ms

Time: 16.436 ms
Time: 16.349 ms
Time: 16.139 ms
Time: 19.871 ms
Time: 17.175 ms
Time: 16.540 ms

There's the 10:1 possible slowdown I was talking about when using the 
default hpet clock, while the faster (but fundamentally unstable) tsc 
clock does better, only around a 4:1 slowdown there.

First question I considered:  is it the operating system or kernel 
causing this, rather than the hardware?  I have Debian Squeeze running a 
64-bit 2.6.32 kernel installed on here too.  It only has two time 
sources available, with hpet being the default:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm
$ uname -a
Linux meddle 2.6.32-5-amd64 #1 SMP Thu Nov 3 03:41:26 UTC 2011 x86_64 
$ psql -c "select version()"
  PostgreSQL 8.4.9 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real 
(Debian 4.4.5-8) 4.4.5, 64-bit

It's not coincidence that these are the only two clock sources that 
actually worked in the older kernel.  jiffies are old stuff that's been 
on the deprecation path for a while, and it's eliminating TSC with this 
message during boot:

[    0.168139] checking TSC synchronization [CPU#0 -> CPU#1]:
[    0.172001] Measured 617820 cycles TSC warp between CPUs, turning off 
TSC clock.
[    0.172001] Marking TSC unstable due to check_tsc_sync_source failed

So that there can be problems with TSC on some hardware is well known 
enough that the newer kernel tests, then rules it out as unusable when 
it fails that test.

Here's the same timing results on this kernel:

Time Source:hpet
Time: 5.701 ms
Time: 3.581 ms
Time: 3.560 ms
Time: 5.696 ms
Time: 3.587 ms
Time: 3.342 ms

Time: 43.492 ms
Time: 43.227 ms
Time: 43.319 ms
Time: 43.503 ms
Time: 43.332 ms
Time: 43.244 ms

Time Source:acpi_pm
Time: 6.350 ms
Time: 3.792 ms
Time: 3.490 ms
Time: 5.806 ms
Time: 3.672 ms
Time: 3.538 ms

Time: 80.093 ms
Time: 79.875 ms
Time: 79.789 ms
Time: 80.033 ms
Time: 80.330 ms
Time: 80.062 ms

That's much better in every way on hpet.  But the time ratio hasn't 
really improved; it's still at least 10:1.  Unless there's some blacker 
magic in an even newer kernel, it seems nothing I can do to this laptop 
will make it run gettimeofday() quickly.

Let's move onto something that's actually a usable server.  My main 
desktop has an Intel i7-860.  It has the same 64-bit Debian Squeeze 
setup as the second set of T60 results I just showed.  The only thing 
different about the two systems is the hardware, software is as close to 
identical as I could make it.  (That's for my own general sanity, not 
something special I did for this test--all my systems have that same 
Debian package set)

But there are different clock options available:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
$ uname -a
Linux grace 2.6.32-5-amd64 #1 SMP Thu Nov 3 03:41:26 UTC 2011 x86_64 
$ psql -c "select version()"
  PostgreSQL 8.4.9 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real 
(Debian 4.4.5-8) 4.4.5, 64-bit

With tsc being the default clock source here.  A passing test of it 
looks like this:

Here's some related dmesg output:

[    0.000000] Fast TSC calibration using PIT
     0.244640] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
[    0.891992] checking TSC synchronization [CPU#0 -> CPU#7]: passed.
[    0.911993] Brought up 8 CPUs

And here's how it performs:

Time Source:tsc
Time: 2.915 ms
Time: 1.531 ms
Time: 1.500 ms
Time: 2.926 ms
Time: 1.518 ms
Time: 1.478 ms

Time: 2.945 ms
Time: 2.915 ms
Time: 2.882 ms
Time: 2.920 ms
Time: 2.862 ms
Time: 2.856 ms

That's more like it.

Next up I plan to survey a few AMD systems and a newer laptop, but I'm 
out of time for this tonight.

Greg Smith   2ndQuadrant US    greg(at)2ndQuadrant(dot)com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support

Attachment: timing-compare
Description: text/plain (161 bytes)
Attachment: explain-overhead.sql
Description: text/x-sql (294 bytes)


pgsql-hackers by date

Next:From: Robert HaasDate: 2011-12-07 03:03:31
Subject: Re: Large number of open(2) calls with bulk INSERT into empty table
Previous:From: Tom LaneDate: 2011-12-07 01:46:27
Subject: Re: Inlining comparators as a performance optimisation

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group