Skip site navigation (1) Skip section navigation (2)

Re: Improve compression speeds in pg_lzcompress.c

From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Takeshi Yamamuro <yamamuro(dot)takeshi(at)lab(dot)ntt(dot)co(dot)jp>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve compression speeds in pg_lzcompress.c
Date: 2013-01-09 07:56:12
Message-ID: CADbMkNPrKe2P7Oku=2sNGyLrd8+wQad_YBpvJtmJBtV17Tmf4A@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-hackers
> Personally, my biggest gripe about the way we do compression is that
> it's easy to detoast the same object lots of times.  More generally,
> our in-memory representation of user data values is pretty much a
> mirror of our on-disk representation, even when that leads to excess
> conversions.  Beyond what we do for TOAST, there's stuff like numeric
> where not only toast but then post-process the results into yet
> another internal form before performing any calculations - and then of
> course we have to convert back before returning from the calculation
> functions.  And for things like XML, JSON, and hstore we have to
> repeatedly parse the string, every time someone wants to do anything
> to do.  Of course, solving this is a very hard problem, and not
> solving it isn't a reason not to have more compression options - but
> more compression options will not solve the problems that I personally
> have in this area, by and large.
>
> At the risk of saying something totally obvious and stupid as I haven't
looked at the actual representation this sounds like a memoisation
problem.  In ocaml terms:

type 'a rep =
  | On_disk_rep     of Byte_sequence
  | In_memory_rep of 'a

type 'a t = 'a rep ref

let get_mem_rep t converter =
  match !t with
  | On_disk_rep seq ->
    let res = converter seq in
    t := In_memory_rep res;
    res
  | In_memory_rep x -> x
;;

... (if you need the other direction that it's straightforward too)...

Translating this into c is relatively straightforward if you have the
luxury of a fresh start
and don't have to be super efficient:

typedef enum { ON_DISK_REP, IN_MEMORY_REP } rep_kind_t;

type t = {
  rep_kind_t rep_kind;
  union {
    char *on_disk;
    void *in_memory;
  } rep;
};

void *get_mem_rep(t *t, void * (*converter)(char *)) {
  void *res;
  switch (t->rep_kind) {
     case ON_DISK_REP:
        res = converter(t->on_disk);
        t->rep.in_memory = res;
        t->rep_kind = IN_MEMORY_REP;
        return res;
     case IN_MEMORY_REP;
        return t->rep.in_memory;
  }
}

Now of course fitting this into the existing types and ensuring that there
is neither too early freeing of memory nor memory leaks or other bugs is
probably a nightmare and why you said that this is a hard problem.

Cheers,

Bene

In response to

pgsql-hackers by date

Next:From: Amit kapilaDate: 2013-01-09 08:05:02
Subject: Re: Performance Improvement by reducing WAL for Update Operation
Previous:From: Shigeru HanadaDate: 2013-01-09 07:03:44
Subject: Re: PATCH: optimized DROP of multiple tables within a transaction

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group