Skip site navigation (1) Skip section navigation (2)

Re: Can lots of small writes badly hamper reads from other tables?

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Can lots of small writes badly hamper reads from other tables?
Date: 2012-01-24 23:46:01
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
On 24.1.2012 22:36, Dave Crooke wrote:
> Hi guys
> Thanks for the quick followups folks .... please note I am deliberately
> running a setup without commit guarantees, so a lot of the conventional
> advice about not doing small writes isn't applicable, but I do want to
> understand more about how this affects PG internals even if the I/O is
> smoothed out.
> By "commit consistency off" I am referring to the setting
> "synchronous_commit = off" in postgresql.conf .... IIRC this should mean
> semantically that a DB crash may lose data that was reported back to the
> app as successfully committed, but will result in a consistent state on
> reboot and recovery. In this case IIUC the "120 commits per second per
> drive" limit does not apply, and I hope the advice about testing fsync
> is similarly not applicable to my case. Also, IIUC that settings like
> commit_siblings and commit_delay should be ignored by PG in my case.
Oh, I haven't noticed the synchronous_commit=off bit. You're right about
the consistency guarantees (possibility of lost transactions but no

IIRC the async commit issues fsync for each commit, but does not wait
for it to finish. The question is whether this improves the way the I/O
is used or not. That's difficult to answer without more detailed info

In some cases this may actually hammer the system even worse, killing
the performance, because you're removing the "wait time" so the INSERT
processes are submitting more fsync operations than it can handle.

There are cases when this may actually improve the I/O utilization (e.g.
when there's a lot of drives in RAID).

You need to watch the drive and CPU stats to identify the causes. Is it
CPU bound (100% cpu utilization)? Is it I/O bound (drives 100% utilized)?

Moreover, it's not just about the fsync operations. If there are
constraints that need to be checked (e.g. foreign keys, unique
constrains etc.), that may cause additional I/O operations.

Maybe you could get better results with commit_delay/commit_siblings.
That effectively groups commits into a single fsync operation. (Which
synchronous_commit=off does not do IIRC).

I've seen really good results with large amounts of concurrent clients.
How many of those "insert" processes are there?

> I would be interested in learning what the **in-memory** constraints and
> costs are on the PG server side of doing a lot of small commits when
> sync wrties are _off_, e.g. the implications for the locking system, and
> whether this can affect the long queries on the other tables apart from
> general resource contention.
I really doubt this is the case. If you're interested in watching these
issues, set up a pgbench database with small scaling factor (so that the
DB fits into memory) and maybe set fsync=off. Then you'll be able to
observe the locking issues etc.

But this all is just a hypothesis, and my suggestion is that you really
verify if before trying to fix it - if the bottleneck really is inside
PostgreSQL (locking or whatever).

Eliminate all the other usual bottlenecks first - I/O and CPU. Show us
some stats, e.g. vmstat, iostat etc.

> The pattern of lots of tiny transactions is semantically correct for the
> app, and I am using a JDBC prepared statement on the Java side, which I
> believe the PG driver will turn in to a pre-compiled statement with
> enough uses (it does NOT do so on the first few hits). This should in
> theory be even cheaper than a multiple INSERT VALUES which is all text
> and has to be parsed.
> However, if necessary for performance I can bundle the inserts into
> slightly larger transactions - cases where writes fail are due only to
> occasional duplicates (same primary key) coming from upstream and are
> pretty rare, and in practice losing a batch of say 100 of these records
> occasionally is not a big deal in my world (ignoring sound of cringing
> DBAs  so I could afford to bundle into transactions and then just drop
> a whole bundle if any single write has a primary key collision.
If it's semantically correct, let's try to keep it that way.

> Storage setup varies by customer, but a typical setup is to take RAID
> groups of about 5-10TB each net from something like an EMC Clariion and
> slice each group into 1TB LUNs which become VMWare datastores, which are
> written simultaneously from multiple hosts. A mid-size Clariion would
> host perhaps 50-100 of these small LUNs, and a customer running a high
> performance environment might have Fibrechannel disks and RAID-10, but
> SATA and RAID-5/6 would also be normal, albeit with a substantial
> write-back cache (maybe 1GB, IIRC a current Clariion SP has 4GB total).
> Each file on the datastore corresponds to a virtual disk on a VM, and
> the datastore is formatted with VMFS (concurrent writer filesystem, uses
> SCSI locking to control access to block allocation and directory entries).
> The other type of VMWare datastore works at the filesystem layer -
> instead of a shared SAN with iSCSI / FC-AL, the VMware hosts are all
> pointed at a shared NFS server directory. NetApp is the popular back end
> for this configuration.
Hmmmm, tuning such workloads is usually tightly bound to the I/O layout.
What works great on one setup is going to fail miserably on another one.

Especially RAID-5/6 are well known to suck at write-intensive workloads.
The usual tuning advice in this case is "OMG, get rid of RAID-5/6!"

You really need to gather some data from each setup, see where's the
bottleneck and fix it. It might be in a different place for each setup.

You need to see "inside" the storage, not just the first level. One
fairly frequent mistake (and I've done that repeatedly) is the belief
that when iostat tell's you a device is 100% utilized it can't handle
more I/Os.

With a RAID array that's not true - what matters is how the individual
devices are used, not the virtual device on top of them. Consider for
example a RAID-1 with two drives. The array may report 100% utilization
but the devices are in fact 50% utilized because half of the requests is
handed to the first device, the other half to the second one.

> On top of this virtualization, I have PG laid out on two virtual disks -
> WAL and log files are on the main system partition, index and table data
> on a second partition. Both formatted with ext3fs.
One suggestion - try to increase the effective_io_concurrency. There're
some recommendations here

Use the number of drives as a starting point or experiment a bit.

And increase the checkpoint parameters as I recommended before. You may
even increase the checkpoint timeout - that may significantly lower the
amount of data that's written during checkpoints.

> One of my larger customers had his SAN guy complain to him that our app
> was writing more data to the NetApp it was on than every other app
> combined, so I am mindful of the volume being more than some of these
> systems were planned for
I'm not familiar with NetApp - AFAIK they use RAID-DP which is a somehow
improved version of RAID 4, that should perform better than RAID 6. But
in my experience these claims usually miss the "for most workloads" part.


In response to

pgsql-performance by date

Next:From: Heikki LinnakangasDate: 2012-01-25 08:58:16
Subject: Re: Cursor fetch performance issue
Previous:From: Dave CrookeDate: 2012-01-24 21:36:36
Subject: Re: Can lots of small writes badly hamper reads from other tables?

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group