Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date: 2014-03-04 03:35:23
Message-ID: 9A28C8860F777E439AA12E8AEA7694F8F837C1@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for your reviewing.

According to the discussion in the Custom-Scan API thread, I moved
all the supplemental facilities (like bms_to/from_string) into the
main patch. So, you don’t need to apply ctidscan and postgres_fdw
patch for testing any more.
(I'll submit the revised one later)

> 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> + memcpy((char *)dest + HEAPTUPLESIZE,
>
> + tuple->t_data, tuple->t_len);
>
> For a normal tuple these two addresses are different but in case of ccache,
> it is a continuous memory.
> Better write a comment as even if it continuous memory, it is treated
> as different only.
>
OK, I put a source code comment as follows:

/*
* Even though we put the body of HeapTupleHeaderData just after
* HeapTupleData, usually, here is no guarantee that both of data
* structures are located on continuous memory address.
* So, we explicitly adjust tuple->t_data to point the area just
* behind of itself, to reference the HeapTuple on columnar-cache
* as like regular ones.
*/

> 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
>
> t_len is already maxaligned. No problem of using it again, The required
> length calculation is differing function to function.
> For example, in below part of the same function, the same t_len is used
> directly. It didn't generate any problem, but it may give some confusion.
>
Once I tried to trust t_len is aligned well, however, Assert() macro
said it is not a right assumption. See heap_compute_data_size(), it
computes length of tuple body and adjusts alignment according to the
"attalign" value of pg_attribute; that is not usually same with
sizeof(Datum).

> 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> + if (pchunk != NULL && pchunk != cchunk)
>
> + ccache_merge_chunk(ccache, pchunk);
>
> + pchunk = cchunk;
>
>
> The merge_chunk is called only when the heap tuples are spread across
> two cache chunks. Actually one cache chunk can accommodate one or more than
> heap pages. it needs some other way of handling.
>
I adjusted the logic to merge the chunks as follows:

Once a tuple is vacuumed from a chunk, it also checks whether it can be merged
with its child leafs. A chunk has up to two child leafs; left one has less ctid
that the parent, and right one has greater ctid. It means a chunk without right
child in the left sub-tree or a chunk without left child in the right sub-tree
are neighbor of the chunk being vacuumed. In addition, if vacuumed chunk does not
have either (or both) of children, it can be merged with parent node.
I modified ccache_vacuum_tuple() to merge chunks during t-tree walk-down, if
vacuumed chunk has enough free space.

> 4. for (i=0; i < 20; i++)
>
> Better to replace this magic number with a meaningful macro.
>
I rethought here is no good reason why we should construct multiple
ccache_entries at once. So, I adjusted the code to create a new
ccache_entry on demand, to track a columnar-cache being acquired.

> 5. "columner" is present in sgml file. correct it.
>
Sorry, fixed it.

> 6. "max_cached_attnum" value in the document saying as 128 by default but
> in the code it set as 256.
>
Sorry, fixed it.

Also, I tried to run a benchmark of cache_scan on the case when this module
performs most effectively.

The table t1 is declared as follows:
create table t1 (a int, b float, c float, d text, e date, f char(200));
Its width is almost 256bytes/record, and contains 4milion records, thus
total table size is almost 1GB.

* 1st trial - it takes longer time than sequential scan because of columnar-
cache construction
postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=200791.62..200791.64 rows=1 width=0) (actual time=63105.036..63105.037 rows=1 loops=1)
-> Custom Scan (cache scan) on t1 (cost=0.00..200741.62 rows=20000 width=0) (actual time=7.397..62832.728 rows=400000 loops=1)
Filter: ((a % 10) = 5)
Rows Removed by Filter: 3600000
Planning time: 214.506 ms
Total runtime: 64629.296 ms
(6 rows)

* 2nd trial - it takes much faster than sequential scan because of no disk access
postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=67457.53..67457.54 rows=1 width=0) (actual time=7833.313..7833.313 rows=1 loops=1)
-> Custom Scan (cache scan) on t1 (cost=0.00..67407.53 rows=20000 width=0) (actual time=0.154..7615.914 rows=400000 loops=1)
Filter: ((a % 10) = 5)
Rows Removed by Filter: 3600000
Planning time: 1.019 ms
Total runtime: 7833.761 ms
(6 rows)

* 3rd trial - turn off the cache_scan, so planner chooses the built-in SeqScan.
postgres=# set cache_scan.enabled = off;
SET
postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=208199.08..208199.09 rows=1 width=0) (actual time=59700.810..59700.810 rows=1 loops=1)
-> Seq Scan on t1 (cost=0.00..208149.08 rows=20000 width=0) (actual time=715.489..59518.095 rows=400000 loops=1)
Filter: ((a % 10) = 5)
Rows Removed by Filter: 3600000
Planning time: 0.630 ms
Total runtime: 59701.104 ms
(6 rows)

The reason why such an extreme result.
I adjusted the system page cache usage to constrain disk cache hit in the operating
system level, so sequential scan is dominated by disk access performance in this
case. On the other hand, columnar cache allowed to host whole of the records because
it omits to cache unreferenced columns.

* GUCs
shared_buffers = 512MB
shared_preload_libraries = 'cache_scan'
cache_scan.num_blocks = 400

[kaigai(at)iwashi backend]$ free -m
total used free shared buffers cached
Mem: 7986 7839 146 0 2 572
-/+ buffers/cache: 7265 721
Swap: 8079 265 7814

Please don't throw me stones. :-)
The primary purpose of this extension is to demonstrate usage of custom-scan
interface and heap_page_prune_hook().

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

> -----Original Message-----
> From: Haribabu Kommi [mailto:kommi(dot)haribabu(at)gmail(dot)com]
> Sent: Monday, February 24, 2014 12:42 PM
> To: Kohei KaiGai
> Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
>
>
> Hello,
>
> The attached patch is a revised one for cache-only scan module
> on top of custom-scan interface. Please check it.
>
>
>
> Thanks for the revised patch. Please find some minor comments.
>
> 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> + memcpy((char *)dest + HEAPTUPLESIZE,
>
> + tuple->t_data, tuple->t_len);
>
>
> For a normal tuple these two addresses are different but in case of ccache,
> it is a continuous memory.
> Better write a comment as even if it continuous memory, it is treated
> as different only.
>
> 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
>
> t_len is already maxaligned. No problem of using it again, The required
> length calculation is differing function to function.
> For example, in below part of the same function, the same t_len is used
> directly. It didn't generate any problem, but it may give some confusion.
>
> 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> + if (pchunk != NULL && pchunk != cchunk)
>
> + ccache_merge_chunk(ccache, pchunk);
>
> + pchunk = cchunk;
>
>
> The merge_chunk is called only when the heap tuples are spread across
> two cache chunks. Actually one cache chunk can accommodate one or more than
> heap pages. it needs some other way of handling.
>
> 4. for (i=0; i < 20; i++)
>
> Better to replace this magic number with a meaningful macro.
>
> 5. "columner" is present in sgml file. correct it.
>
> 6. "max_cached_attnum" value in the document saying as 128 by default but
> in the code it set as 256.
>
> I will start regress and performance tests. I will inform you the same once
> i finish.
>
>
> Regards,
> Hari Babu
>
> Fujitsu Australia

Attachment Content-Type Size
pgsql-v9.4-custom-scan.part-4.v9.patch application/octet-stream 89.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2014-03-04 03:39:45 Re: jsonb and nested hstore
Previous Message Josh Berkus 2014-03-04 02:59:37 Re: jsonb and nested hstore