[Proposal] Page Compression for OLTP

From: chenhj <chjischj(at)163(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: [Proposal] Page Compression for OLTP
Date: 2020-05-21 06:36:40
Message-ID: 272dd2d9.e52a.17235f2c050.Coremail.chjischj@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers!

This email is about the proposal to implement OLTP-oriented compression on PostgreSQL.

Although there are currently some alternative ways to achieve compression, such as using a file system that supports compression.
However, this depends on a specific file system and is not suitable for all deployment environments, but also increases the complexity of deployment and maintenance.

I hope this compression work can meet the following goals
1. In most scenarios, the compressed size of the table can be lower than 50% of the original table
2. Mainly oriented to the OLTP scenario, the performance impact on the load of frequent reads and writes is relatively small.
3. Does not rely on special external software or hardware that is difficult to obtain
4. Friendly to application developers and database managers
5. The transformation of PostgreSQL is small

I have noticed that there has been some discussion or work related to compression before, but they do not meet the above goals.
such as,
1. Use the sparse file[1]
This is also the implemention method of MySQL 5.7's transparent page compression. However, sparse files may generate a lot of fragmentation inside the file system, and the "compressed" data file in sparse files will be restored to their original size after physical backup and restoration, unless our backup and recovery tools also support sparse files.
2. Use TAM (table access method interface) (pg_cryogen, zedstore) [2] [3]
Neither storage engine is geared towards OLTP scenarios. It is best to make relatively small modifications to the existing code of the heap engine (by modify md.c and fd.c mainly).

The methods proposed by Postgres Pro Enterprise CFS [4] and Nikolay P [5] are close to my needs.
However, I would like to mention a slightly different implementation plan, which does not require space reclamation. Hope to get any suggestions.

# Premise assumption
1. Most of the pages in the compressed table can be compressed to less than 50% of the original size
As long as you use an algorithm with a relatively high compression ratio (such as zlib, zstd), this first point should be easy to meet. Unless the table stores compressed data, such as pictures.
2. The compression ratio of most pages in the same table is relatively close

# Page storage

Configure 3 files for storing compressed data for each segment of each main fork. The main fork segment file(for example: 123456.2) still exists, but its size is 0.

-Compressed data file (for example: 123456.2.cd)
Used to store the compressed page. The block size of this file is table level configurable. But it can only be 1/2, 1/4 or 1/8 of BLOCKSZ
-Compress overflow address file (for example: 123456.2.coa)
When a page cannot be compressed to less than the size of the compressed block, this file is used to store the address of the overflow block.
-Compress overflow data file (for example: 123456.2.cod)
When a page cannot be compressed to less than the size of the compressed block, this file is used to store the overflow block.

The following is an example when the compressed block size is 4K, which is 1/2 of BLOCKSZ.

## Scenario 1: The compressed size of the original page (including the header of the compressed page) is less than or equal to the compressed block size (4KB)

Compressed data files(123456.2.cd):
0 1 2
+=======+=======+=======+
| data0 | data1 | data2 |
+=======+=======+=======+
->| 4K |<-

## Scenario 2: The compressed size of the original page (including the header of the compressed page) is larger than the compressed block size (4KB)

If the compressed size of the original page (page 3 below) is greater than 4KB, it will not be compressed. The first 4KB of the original page is stored in the compressed data file, and the last 4KB is stored in the compress overflow data file.

Compressed data files(123456.2.cd):

0 1 2 3
+=======+=======+=======+=========+
| data0 | data1 | data2 | data3_1 |
+=======+=======+=======+=========+
->| 1st 4K |<-

Compress overflow address file(123456.2.coa):
The compress overflow address file stores the block number of the compress overflow block assigned to each block + 1
The size of the compressed block and the number of expanded blocks of the compress overflow data file are stored in the head of the compress overflow address file

0 1 2 3
+=======+=======+=======+=======+=======+
| head | | | | 1 |
+=======+=======+=======+=======+===|===+
|
|
Compress overflow data file: |
_______________________________|
|
0 | 1 2 3
+===|=====+=========+==========+=========+
| data3_2 | | | |
+=========+=========+==========+=========+
->| 2nd 4K |<-

If the compressed block size is 1/4 or 1/8 of BLOCKSZ, each block that fails to compress may require multiple compress overflow block storage.
The following is an example when the compressed block size is 2K, which is 1/4 of BLOCKSZ.

## Scenario 3: The compressed size of the original page (including the header of the compressed page) is larger than 2KB(compressed page block size) but less than 6KB (BLOCKSZ - compressed page block size )

In this case, data files will store compressed data, and at least 2KB storage space can be saved.

Compressed data files(123456.2.cd):

0 1 2 3
+=======+=======+=======+=========+
| data0 | data1 | data2 | data3_1 |
+=======+=======+=======+=========+
->| 1st 2K |<-

Compress overflow address file(123456.2.coa):

0 1 2 3
+=======+=======+=======+=======+=======+
| head | | | | 1,2 |
+=======+=======+=======+=======+===|===+
|
|
Compress overflow data file : |
_______________________________|
|
0 | 1 2 3
+===|=====+=========+==========+=========+
| data3_2 | data3_3 | | |
+=========+=========+==========+=========+
| 2nd 2K | 3rh 2K | | |

## Scenario 4: The compressed size of the original page (including the header of the compressed page) is larger than 6KB (BLOCKSZ - compressed page block size )

In this case, data files will store original data(8KB). same as Scenario 2

Compressed data files(123456.2.cd):

0 1 2 3
+=======+=======+=======+=========+
| data0 | data1 | data2 | data3_1 |
+=======+=======+=======+=========+
->| 1st 2K |<-

Compress overflow address file(123456.2.coa):

0 1 2 3
+=======+=======+=======+=======+=======+
| head | | | | 1,2,3 |
+=======+=======+=======+=======+===|===+
|
|
Compress overflow data file : |
_______________________________|
|
0 | 1 2 3
+===|=====+=========+==========+=========+
| data3_2 | data3_3 | data3_4 | |
+=========+=========+==========+=========+
| 2nd 2K | 3rd 2K | 4th 2K | |

# How to distinguish between compressed or uncompressed blocks in compressed data files?

The PostgreSQL heap file has a uniform header. At first, I considered adding compression-related flags to the header.
However, there will be a problem. When the size of the data in the page after compression changes, from compressed format to uncompressed format, or from uncompressed format to compressed format,
Need to modify the head of the original page, not only to recalculate the checksum, but also update the buffer.

However, I noticed that the first 4 bytes of each page are the high part of pg_lsn.
Therefore, use an oversized `lsn number` that cannot appear in the real environment as a sign of whether it is a magic of compressed pages.
The definition of the envisaged compressed header is as follows

typedef struct
{
uint32 magic; /* compress page magic number,must be 0xfffffffe */
uint8 algorithm; /* 1=pglz, 2=zstd ...*/
uint8 flag; /* reserved */
uint16 size; /* size after compressed */
} PageCompressHead;

# How to manage block space in compress overflow data files?

Once the overflow block x in the compress overflow data file is allocated to the block a, it will always belong to the block a, even if the size of the block a after compression becomes smaller and the overflow block x is no longer be used.

This approach simplifies the space management of compress overflow blocks, but fragmentation may occur, especially when the compressed block size is 1/4 or 1/8 BLOCKSZ.
However, the fragment will only appear in the scene where the size of the same block is frequently changed greatly after compression.

Consider the following situation. If only one record is inserted into a page and written to the disk, the compression rate must be very high, and only one compressed block is required.
After writing new rows in the future, the required compressed blocks will become 2, 3, 4 ... These overflow blocks are not allocated at a time, so it is likely that they are not sequential in the compress overflow data file, resulting in more fragmentation.

We can avoid this problem by setting a table-level number of pre-allocated compressed blocks.
When the number of compressed blocks required after the original page is compressed is less than this value, space is allocated according to the number of pre-allocated compressed blocks.

And no matter how severe the fragmentation, the total space occupied by the compressed table cannot be larger than the original table before compression.

# Impact on other parts of PostgreSQL?
1. pg_basebackup / pg_checksum needs to handle checksum verification according to the new compression format
2. pg_rewind needs to handle the copying of data blocks according to the new compression format

# Problems
This solution simplifies copy storage space management, but also has the following defects
1. The space saved by compression is limited by the size of the compressed block.
For example, when the compressed block size is set to 4KB, up to 50% of space can be saved.
For inserting only unmodified tables, you can set the compressed block size to BLOCKSZ / 8 to alleviate this problem; but for scenarios with frequent writes, it is easy to generate fragments and increase the number of IOs.
2. When accessing a page that can be compressed to a compressed block, only one IO is required; but when accessing a page that cannot be compressed to a compressed block, multiple IO is required
Generally it is 3 times, the address file is very small, it should be almost in the memory, not counting the address file is 2 times.

I think the above issues are a necessary trade-off. Any suggestions are welcome.

# references
[1] https://www.postgresql.org/message-id/flat/op.ux8if71gcigqcu%40soyouz
[2] https://www.postgresql.org/message-id/CAONYFtNDNghfatnrhOJMOT=BXNbEiobHFABt2sx_cn2=5t=1_Q@mail.gmail.com
[3] https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w%40mail.gmail.com
[4] https://postgrespro.com/docs/enterprise/9.6/cfs
[5] https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net

Best Regards
Chen Hujaun

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2020-05-21 06:44:56 Re: Schedule of commit fests for PG14
Previous Message Etsuro Fujita 2020-05-21 06:36:10 Re: Optimizer docs typos