Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date: 2014-03-04 04:07:11
Message-ID: 9A28C8860F777E439AA12E8AEA7694F8F8382D@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sorry, the previous one still has "columner" in the sgml files.
Please see the attached one, instead.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Kouhei Kaigai
> Sent: Tuesday, March 04, 2014 12:35 PM
> To: Haribabu Kommi; Kohei KaiGai
> Cc: Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> Thanks for your reviewing.
>
> According to the discussion in the Custom-Scan API thread, I moved all the
> supplemental facilities (like bms_to/from_string) into the main patch. So,
> you don’t need to apply ctidscan and postgres_fdw patch for testing any
> more.
> (I'll submit the revised one later)
>
> > 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> > + memcpy((char *)dest + HEAPTUPLESIZE,
> >
> > + tuple->t_data, tuple->t_len);
> >
> > For a normal tuple these two addresses are different but in case of
> > ccache, it is a continuous memory.
> > Better write a comment as even if it continuous memory, it is
> > treated as different only.
> >
> OK, I put a source code comment as follows:
>
> /*
> * Even though we put the body of HeapTupleHeaderData just after
> * HeapTupleData, usually, here is no guarantee that both of data
> * structures are located on continuous memory address.
> * So, we explicitly adjust tuple->t_data to point the area just
> * behind of itself, to reference the HeapTuple on columnar-cache
> * as like regular ones.
> */
>
> > 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
> >
> > t_len is already maxaligned. No problem of using it again, The
> > required length calculation is differing function to function.
> > For example, in below part of the same function, the same t_len is
> > used directly. It didn't generate any problem, but it may give some
> confusion.
> >
> Once I tried to trust t_len is aligned well, however, Assert() macro said
> it is not a right assumption. See heap_compute_data_size(), it computes
> length of tuple body and adjusts alignment according to the "attalign" value
> of pg_attribute; that is not usually same with sizeof(Datum).
>
> > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > + if (pchunk != NULL && pchunk != cchunk)
> >
> > + ccache_merge_chunk(ccache, pchunk);
> >
> > + pchunk = cchunk;
> >
> >
> > The merge_chunk is called only when the heap tuples are spread
> > across two cache chunks. Actually one cache chunk can accommodate one
> > or more than heap pages. it needs some other way of handling.
> >
> I adjusted the logic to merge the chunks as follows:
>
> Once a tuple is vacuumed from a chunk, it also checks whether it can be
> merged with its child leafs. A chunk has up to two child leafs; left one
> has less ctid that the parent, and right one has greater ctid. It means
> a chunk without right child in the left sub-tree or a chunk without left
> child in the right sub-tree are neighbor of the chunk being vacuumed. In
> addition, if vacuumed chunk does not have either (or both) of children,
> it can be merged with parent node.
> I modified ccache_vacuum_tuple() to merge chunks during t-tree walk-down,
> if vacuumed chunk has enough free space.
>
> > 4. for (i=0; i < 20; i++)
> >
> > Better to replace this magic number with a meaningful macro.
> >
> I rethought here is no good reason why we should construct multiple
> ccache_entries at once. So, I adjusted the code to create a new ccache_entry
> on demand, to track a columnar-cache being acquired.
>
> > 5. "columner" is present in sgml file. correct it.
> >
> Sorry, fixed it.
>
> > 6. "max_cached_attnum" value in the document saying as 128 by default
> > but in the code it set as 256.
> >
> Sorry, fixed it.
>
>
> Also, I tried to run a benchmark of cache_scan on the case when this module
> performs most effectively.
>
> The table t1 is declared as follows:
> create table t1 (a int, b float, c float, d text, e date, f char(200));
> Its width is almost 256bytes/record, and contains 4milion records, thus
> total table size is almost 1GB.
>
> * 1st trial - it takes longer time than sequential scan because of columnar-
> cache construction
> postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
> QUERY PLAN
> ----------------------------------------------------------------------
> --------------------------------------------------------------
> Aggregate (cost=200791.62..200791.64 rows=1 width=0) (actual
> time=63105.036..63105.037 rows=1 loops=1)
> -> Custom Scan (cache scan) on t1 (cost=0.00..200741.62 rows=20000
> width=0) (actual time=7.397..62832.728 rows=400000 loops=1)
> Filter: ((a % 10) = 5)
> Rows Removed by Filter: 3600000 Planning time: 214.506 ms Total
> runtime: 64629.296 ms
> (6 rows)
>
> * 2nd trial - it takes much faster than sequential scan because of no disk
> access postgres=# explain analyze select count(*) from t1 where a % 10 =
> 5;
> QUERY PLAN
> ----------------------------------------------------------------------
> ------------------------------------------------------------
> Aggregate (cost=67457.53..67457.54 rows=1 width=0) (actual
> time=7833.313..7833.313 rows=1 loops=1)
> -> Custom Scan (cache scan) on t1 (cost=0.00..67407.53 rows=20000
> width=0) (actual time=0.154..7615.914 rows=400000 loops=1)
> Filter: ((a % 10) = 5)
> Rows Removed by Filter: 3600000 Planning time: 1.019 ms Total
> runtime: 7833.761 ms
> (6 rows)
>
> * 3rd trial - turn off the cache_scan, so planner chooses the built-in
> SeqScan.
> postgres=# set cache_scan.enabled = off; SET postgres=# explain analyze
> select count(*) from t1 where a % 10 = 5;
> QUERY PLAN
> ----------------------------------------------------------------------
> ------------------------------------------------
> Aggregate (cost=208199.08..208199.09 rows=1 width=0) (actual
> time=59700.810..59700.810 rows=1 loops=1)
> -> Seq Scan on t1 (cost=0.00..208149.08 rows=20000 width=0) (actual
> time=715.489..59518.095 rows=400000 loops=1)
> Filter: ((a % 10) = 5)
> Rows Removed by Filter: 3600000 Planning time: 0.630 ms Total
> runtime: 59701.104 ms
> (6 rows)
>
> The reason why such an extreme result.
> I adjusted the system page cache usage to constrain disk cache hit in the
> operating system level, so sequential scan is dominated by disk access
> performance in this case. On the other hand, columnar cache allowed to host
> whole of the records because it omits to cache unreferenced columns.
>
> * GUCs
> shared_buffers = 512MB
> shared_preload_libraries = 'cache_scan'
> cache_scan.num_blocks = 400
>
> [kaigai(at)iwashi backend]$ free -m
> total used free shared buffers
> cached
> Mem: 7986 7839 146 0 2
> 572
> -/+ buffers/cache: 7265 721
> Swap: 8079 265 7814
>
>
> Please don't throw me stones. :-)
> The primary purpose of this extension is to demonstrate usage of custom-scan
> interface and heap_page_prune_hook().
>
> Thanks,
> --
> NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei
> <kaigai(at)ak(dot)jp(dot)nec(dot)com>
>
>
> > -----Original Message-----
> > From: Haribabu Kommi [mailto:kommi(dot)haribabu(at)gmail(dot)com]
> > Sent: Monday, February 24, 2014 12:42 PM
> > To: Kohei KaiGai
> > Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> > Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for
> > cache-only table scan?)
> >
> > On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
> >
> >
> > Hello,
> >
> > The attached patch is a revised one for cache-only scan module
> > on top of custom-scan interface. Please check it.
> >
> >
> >
> > Thanks for the revised patch. Please find some minor comments.
> >
> > 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> > + memcpy((char *)dest + HEAPTUPLESIZE,
> >
> > + tuple->t_data, tuple->t_len);
> >
> >
> > For a normal tuple these two addresses are different but in case of
> > ccache, it is a continuous memory.
> > Better write a comment as even if it continuous memory, it is
> > treated as different only.
> >
> > 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
> >
> > t_len is already maxaligned. No problem of using it again, The
> > required length calculation is differing function to function.
> > For example, in below part of the same function, the same t_len is
> > used directly. It didn't generate any problem, but it may give some
> confusion.
> >
> > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > + if (pchunk != NULL && pchunk != cchunk)
> >
> > + ccache_merge_chunk(ccache, pchunk);
> >
> > + pchunk = cchunk;
> >
> >
> > The merge_chunk is called only when the heap tuples are spread
> > across two cache chunks. Actually one cache chunk can accommodate one
> > or more than heap pages. it needs some other way of handling.
> >
> > 4. for (i=0; i < 20; i++)
> >
> > Better to replace this magic number with a meaningful macro.
> >
> > 5. "columner" is present in sgml file. correct it.
> >
> > 6. "max_cached_attnum" value in the document saying as 128 by default
> > but in the code it set as 256.
> >
> > I will start regress and performance tests. I will inform you the same
> > once i finish.
> >
> >
> > Regards,
> > Hari Babu
> >
> > Fujitsu Australia

Attachment Content-Type Size
pgsql-v9.4-custom-scan.part-4.v9.patch application/octet-stream 89.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2014-03-04 04:10:50 Re: GSoC proposal - "make an unlogged table logged"
Previous Message Peter Geoghegan 2014-03-04 04:01:50 Re: jsonb and nested hstore