Re: Help with bulk read performance

From: Krzysztof Nienartowicz <krzysztof(dot)nienartowicz(at)gmail(dot)com>
To: <Daniel(dot)S(dot)Schaffer(at)noaa(dot)gov>
Cc: pgsql-performance(at)postgresql(dot)org, Jim Nasby <jim(at)nasby(dot)net>, Nick Matheson <Nick(dot)D(dot)Matheson(at)noaa(dot)gov>, Andy Colson <andy(at)squeakycode(dot)net>
Subject: Re: Help with bulk read performance
Date: 2010-12-16 13:39:11
Message-ID: 8B1B00CC-D1A3-46B6-90B3-C862900F8DAB@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello Daniel,
We have the same scenario for the native Java arrays, so we are storing bytea and doing conversion at the client side, but for the server side SQL, plJava comes very handy:

No sure how you want to create stored procedures to convert internally but this is how we do this:

One has to define conversion routines in Java then deploy them to plJava. Scanning though this field would be still CPU bound, around 2x slower than with native arrays and 6x slower than with blobs, but at least one has this ability. It's even possible to pass them to plR to do some statistical processing directly, so depending on the operations you do it may be still cheaper then streaming out over the wire to the regular JDBC client.

1. deploy class like this within plJava (null handling left out for brevity)

import java.io.File;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;

public class myArrayConversion
{

public myArrayConversion() {}

/** Serialize double array to blob */
public static byte[] convertDoubleArrayToBytea(double[] obj) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(baos);
oos.writeObject(obj);
return baos.toByteArray();
}
/** Serialize int array to blob */
public static byte[] convertIntToBytea(int[] obj) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(baos);
oos.writeObject(obj);
return baos.toByteArray();
}

/** Deserialize blob to double array */
public static double[] convertToDoubleArray(byte[] obj) throws IOException,
ClassNotFoundException {
// Deserialize from a byte array
ObjectInputStream ios = new ObjectInputStream(new ByteArrayInputStream(obj));
return (double[])ios.readObject();
}

/** Deserialize blob to it array */
public static int[] convertIntToArray(byte[] obj) throws IOException,
ClassNotFoundException {
// Deserialize from a byte array
ObjectInputStream ios = new ObjectInputStream(new ByteArrayInputStream(obj));
return (int[])ios.readObject();
}

// other types arrays streaming...
//...
}

2. then create a mapping functions as a db owner:

<sql>
CREATE OR REPLACE FUNCTION public.convertDoubleArrayToBytea(double precision[])
RETURNS bytea AS
'mappingPkg.convertDoubleArrayToBytea(double[])'
LANGUAGE 'javau' IMMUTABLE
COST 50;

GRANT EXECUTE ON FUNCTION public.convertDoubleArrayToBytea(double precision[]) TO public;

CREATE OR REPLACE FUNCTION public.convertToDoubleArray(bytea)
RETURNS double precision[] AS
'mappingPkg.convertToDoubleArray(byte[])'
LANGUAGE 'javau' IMMUTABLE
COST 50;

GRANT EXECUTE ON FUNCTION public.convertToDoubleArray(bytea) TO public;
</sql>

then you can have conversion either way:

select convertToDoubleArray(convertDoubleArrayToBytea(array[i::float8,1.1,100.1,i*0.1]::float8[])) from generate_series(1,100) i;

so you'd be also able to create bytea objects from native SQL arrays within SQL.

PLJava seems to be enjoying revival last days thanks to Johann 'Myrkraverk' Oskarsson who fixed several long-standing bugs. Check out the plJava list for details.

Krzysztof

On Dec 16, 2010, at 10:22 AM, pgsql-performance-owner(at)postgresql(dot)org wrote:

> From: Dan Schaffer <Daniel(dot)S(dot)Schaffer(at)noaa(dot)gov>
> Date: December 15, 2010 9:15:14 PM GMT+01:00
> To: Andy Colson <andy(at)squeakycode(dot)net>
> Cc: Jim Nasby <jim(at)nasby(dot)net>, pgsql-performance(at)postgresql(dot)org, Nick Matheson <Nick(dot)D(dot)Matheson(at)noaa(dot)gov>
> Subject: Re: Help with bulk read performance
> Reply-To: Daniel(dot)S(dot)Schaffer(at)noaa(dot)gov
>
>
> Hi,
> My name is Dan and I'm a co-worker of Nick Matheson who initially submitted this question (because the mail group had me blacklisted for awhile for some reason).
>
>
> Thank you for all of the suggestions. We were able to improve out bulk read performance from 3 MB/s to 60 MB/s (assuming the data are NOT in cache in both cases) by doing the following:
>
> 1. Storing the data in a "bytea" column instead of an "array" column.
> 2. Retrieving the data via the Postgres 9 CopyManager#copyOut(String sql, OutputStream stream) method
>
> The key to the dramatic improvement appears to be the reduction in packing and unpacking time on the server and client, respectively. The server packing occurs when the retrieved data are packed into a bytestream for sending across the network. Storing the data as a simple byte array reduces this time substantially. The client-side unpacking time is spent generating a ResultSet object. By unpacking the bytestream into the desired arrays of floats by hand instead, this time became close to negligible.
>
> The only downside of storing the data in byte arrays is the loss of transparency. That is, a simple "select *" of a few rows shows bytes instead of floats. We hope to mitigate this by writing a simple stored procedures that unpacks the bytes into floats.
>
> A couple of other results:
>
> If the data are stored as a byte array but retrieve into a ResultSet, the unpacking time goes up by an order of magnitude and the observed total throughput is 25 MB/s. If the data are stored in a Postgres float array and unpacked into a byte stream, the observed throughput is 20 MB/s.
>
> Dan (and Nick)

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2010-12-16 15:09:21 Re: performance libpq vs JDBC
Previous Message Eric Comeau 2010-12-16 12:55:46 Re: How to get FK to use new index without restarting the database