Re: Are there plans to add data compression feature to postgresql?

From: 小波 顾 <guxiaobo1982(at)hotmail(dot)com>
To: Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
Cc: <chris(dot)ellis(at)shropshire(dot)gov(dot)uk>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Are there plans to add data compression feature to postgresql?
Date: 2008-10-29 16:09:10
Message-ID: BLU139-W448D1B9C083795A053BDF4CD260@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Data Compression
The new data compression feature in SQL Server 2008 reduces the size of tables, indexes or a subset of their partitions by storing fixed-length data types in variable length storage format and by reducing the redundant data. The space savings achieved depends on the schema and the data distribution. Based on our testing with various data warehouse databases, we have seen a reduction in the size of real user databases up to 87% (a 7 to 1 compression ratio) but more commonly you should expect a reduction in the range of 50-70% (a compression ratio between roughly 2 to 1 and 3 to 1).
SQL Server provides two types of compression as follows:
· ROW compression enables storing fixed length types in variable length storage format. So for example, if you have a column of data type BIGINT which takes 8 bytes of storage in fixed format, when compressed it takes a variable number of bytes—anywhere from 0 bytes to up to 8 bytes. Since column values are stored as variable length, an additional 4‑bit length code is stored for each field within the row. Additionally, zero and NULL values don’t take any storage except for the 4‑bit code.
· PAGE compression is built on top of ROW compression. It minimizes storage of redundant data on the page by storing commonly occurring byte patterns on the page once and then referencing these values for respective columns. The byte pattern recognition is type-independent. Under PAGE compression, SQL Server optimizes space on a page using two techniques.
The first technique is column prefix. In this case, the system looks for a common byte pattern as a prefix for all values of a specific column across rows on the page. This process is repeated for all the columns in the table or index. The column prefix values that are computed are stored as an anchor record on the page and the data or index rows refer to the anchor record for the common prefix, if available, for each column.
The second technique is page level dictionary. This dictionary stores common values across columns and rows and stores them in a dictionary. The columns are then modified to refer to the dictionary entry.
Compression comes with additional CPU cost. This overhead is paid when you query or execute DML operations on compressed data. The relative CPU overhead with ROW is less than for PAGE, but PAGE compression can provide better compression. Since there are many kinds of workloads and data patterns, SQL Server exposes compression granularity at a partition level. You can choose to compress the whole table or index or a subset of partitions. For example, in a DW workload, if CPU is the dominant cost in your workload but you want to save some disk space, you may want to enable PAGE compression on partitions that are not accessed frequently while not compressing the current partition(s) that are accessed and manipulated more frequently. This reduces the total CPU cost, at a small increase in disk space requirements. If I/O cost is dominant for your workload, or you need to reduce disk space costs, compressing all data using PAGE compression may be the best choice. Compression can give many-fold speedups if it causes your working set of frequently touched pages to be cached in the main memory buffer pool, when it does not otherwise fit in memory. Preliminary performance results on one large-scale internal DW query performance benchmark used to test SQL Server 2008 show a 58% disk savings, an average 15% reduction in query runtime, and an average 20% increase in CPU cost. Some queries speeded up by a factor of up to seven. Your results depend on your workload, database, and hardware.
The commands to compress data are exposed as options in CREATE/ALTER DDL statements and support both ONLINE and OFFLINE mode. Additionally, a stored procedure is provided to help you estimate the space savings prior to actual compression.
Backup Compression
Backup compression helps you to save in multiple ways.
By reducing the size of your SQL backups, you save significantly on disk media for your SQL backups. While all compression results depend on the nature of the data being compressed, results of 50% are not uncommon, and greater compression is possible. This enables you to use less storage for keeping your backups online, or to keep more cycles of backups online using the same storage.
Backup compression also saves you time. Traditional SQL backups are almost entirely limited by I/O performance. By reducing the I/O load of the backup process, we actually speed up both backups and restores.
Of course, nothing is entirely free, and this reduction in space and time come at the expense of using CPU cycles. The good news here is that the savings in I/O time offsets the increased use of CPU time, and you can control how much CPU is used by your backups at the expense of the rest of your workload by taking advantage of the Resource Governor.

URL:http://msdn.microsoft.com/en-us/library/cc278097.aspx

Date: Wed, 29 Oct 2008 15:35:44 +0000From: gryzman(at)gmail(dot)comTo: guxiaobo1982(at)hotmail(dot)comSubject: Re: [GENERAL] Are there plans to add data compression feature to postgresql?CC: chris(dot)ellis(at)shropshire(dot)gov(dot)uk; pgsql-general(at)postgresql(dot)org
2008/10/29 小波 顾 <guxiaobo1982(at)hotmail(dot)com>

1. Little integers of types take 8 bytes in the past now only take 4 or 2 bytes if there are not so large.
So what actually happen if I have a table with few mills of values that fit in 2 bytes, but all of the sudent I am going to add another column with something that requires 8 bytes ? update on all columns ? I am actually even against varchars in my databases, so something like that sounds at least creepy.

-- GJ
_________________________________________________________________
Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy!
http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.aspx&mkt=en-us

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2008-10-29 16:09:37 Re: psql screen size
Previous Message Tom Lane 2008-10-29 16:05:32 Re: psql screen size