Re: Improve compression speeds in pg_lzcompress.c

From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Takeshi Yamamuro <yamamuro(dot)takeshi(at)lab(dot)ntt(dot)co(dot)jp>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve compression speeds in pg_lzcompress.c
Date: 2013-01-09 07:56:12
Message-ID: CADbMkNPrKe2P7Oku=2sNGyLrd8+wQad_YBpvJtmJBtV17Tmf4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Personally, my biggest gripe about the way we do compression is that
> it's easy to detoast the same object lots of times. More generally,
> our in-memory representation of user data values is pretty much a
> mirror of our on-disk representation, even when that leads to excess
> conversions. Beyond what we do for TOAST, there's stuff like numeric
> where not only toast but then post-process the results into yet
> another internal form before performing any calculations - and then of
> course we have to convert back before returning from the calculation
> functions. And for things like XML, JSON, and hstore we have to
> repeatedly parse the string, every time someone wants to do anything
> to do. Of course, solving this is a very hard problem, and not
> solving it isn't a reason not to have more compression options - but
> more compression options will not solve the problems that I personally
> have in this area, by and large.
>
> At the risk of saying something totally obvious and stupid as I haven't
looked at the actual representation this sounds like a memoisation
problem. In ocaml terms:

type 'a rep =
| On_disk_rep of Byte_sequence
| In_memory_rep of 'a

type 'a t = 'a rep ref

let get_mem_rep t converter =
match !t with
| On_disk_rep seq ->
let res = converter seq in
t := In_memory_rep res;
res
| In_memory_rep x -> x
;;

... (if you need the other direction that it's straightforward too)...

Translating this into c is relatively straightforward if you have the
luxury of a fresh start
and don't have to be super efficient:

typedef enum { ON_DISK_REP, IN_MEMORY_REP } rep_kind_t;

type t = {
rep_kind_t rep_kind;
union {
char *on_disk;
void *in_memory;
} rep;
};

void *get_mem_rep(t *t, void * (*converter)(char *)) {
void *res;
switch (t->rep_kind) {
case ON_DISK_REP:
res = converter(t->on_disk);
t->rep.in_memory = res;
t->rep_kind = IN_MEMORY_REP;
return res;
case IN_MEMORY_REP;
return t->rep.in_memory;
}
}

Now of course fitting this into the existing types and ensuring that there
is neither too early freeing of memory nor memory leaks or other bugs is
probably a nightmare and why you said that this is a hard problem.

Cheers,

Bene

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit kapila 2013-01-09 08:05:02 Re: Performance Improvement by reducing WAL for Update Operation
Previous Message Shigeru Hanada 2013-01-09 07:03:44 Re: PATCH: optimized DROP of multiple tables within a transaction