Re: Turbo ODBC

From: Matthew Rocklin <mrocklin(at)continuum(dot)io>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Wes McKinney <wesmckinn(at)gmail(dot)com>
Cc: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, "psycopg(at)postgresql(dot)org" <psycopg(at)postgresql(dot)org>
Subject: Re: Turbo ODBC
Date: 2017-01-16 23:24:42
Message-ID: CAM4sjbmeRHLD9RqmqYP6zZoojy8z7Y+zsRnSQCGUaUXJvd-EkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: psycopg

CC'ing Wes McKinney in case he's interested in this problem (how do we
efficiently move data from Postgres into Python).

On Mon, Jan 16, 2017 at 6:14 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:

> On 1/15/17 3:25 PM, Adrian Klaver wrote:
>
>> Have you looked at asyncpg:
>>
>
> I'm pulling Matt Rocklin in, who I've been working with on some plpython
> improvements. Here's his offlist replies to me:
>
> On 1/15/17 5:15 PM, Matthew Rocklin wrote:
>>
>>> You might also contact Wes McKinney and check out
>>> Arrow: http://pyarrow.readthedocs.io/en/latest/
>>>
>>> This project was designed for inter-system data interchange. It would
>>> help with conversion to Spark and other popular tools as well.
>>>
>>> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin(at)continuum(dot)io
>>> <mailto:mrocklin(at)continuum(dot)io>> wrote:
>>>
>>> I haven't taken a look at asyncpg.
>>>
>>> Quick note, you might want to be careful about mixing HDFS and
>>> HDF5. Postgres's competition for data storage is HDF5 not HDFS
>>> (Which stands for the Hadoop File System)
>>>
>>> I still think that the best thing to do here would be to dump out
>>> python arrays for each of the columns in a result set. I suspect
>>> that you could beat /any/ system that doesn't do this handily. This
>>> would avoid any performance pain of building up Python objects, and
>>> can continue to use just the standard library.
>>>
>>> You can stick to Pure Python and still send arrays encoded with
>>> bytes. We can convert these to NumPy or Pandas trivially with zero
>>> overhead.
>>>
>>> In [1]: from array import array # this is in the standard library
>>> In [2]: b = b'\x00' * 16 # 16 zeroes as a bytestring
>>> In [3]: a = array('L', b) # eight byte unsigned integer
>>> In [4]: a
>>> Out[4]: array('L', [0, 0])
>>>
>>> We would only be bound by
>>>
>>> 1. The cost in postgres to tranpose the data from row-major to
>>> column major
>>> 2. The cost to move bytes across a socket
>>>
>>> This should run at gigabytes per second. Everything else is likely
>>> to be competing around to 50-100MB/s range at top speed is my guess.
>>>
>>
> My guess is that the big gains for external communication would come from:
>
> 1) reducing latency effects
> 2) reducing data marshaling
>
> 1 could be accomplished in various ways, and some changes to the FE-BE
> protocol might be helpful. If we can demonstrate a clear win we might be
> able to get protocol changes. I do suspect this needs to be tune-able
> though. My 4x improvement to plpy.execute is changing from the equivalent
> to a large batch to single row operation. Moving data to python over a
> local filesystem socket would be hurt be a large batch size, while
> something on a high latency connection would presumably benefit from
> moderately large batches since that reduces overall TCP overhead.
>
> I think 2 would necessitate FE-BE changes, although some of that could be
> tested without the changes. Right now, everything is going to get marshaled
> into a simple text format (ie: ints become a string of digits), but for
> test purposes there's ways you could run that through the binary output
> functions (in the case of int4, you'd get 4 bytes in network order, which
> should be faster to handle (and could be passed directly to something like
> ndarray).
>
> The two problems you run into with the array type are handling NULLs and
> building columns from rows. You could allow the option of specifying that
> NULLs won't be allowed in specific fields, or you could use something like
> a Pandas Series that provides other ways of handling the equivalent of NULL.
>
> For the row to column conversion, there's simply no way to get around the
> pattern of appending items one at a time to an array-like container (list,
> ndarray, Series, etc), and dealing with the memory allocation problem. I've
> looked at python's list code and it's certainly no worse at that then what
> Postgres would do. The big reason I would do that work in python though is
> you then have the ability to use an number of types for that; they just
> have to support the equivalent of append().
>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)
>

In response to

Responses

Browse psycopg by date

  From Date Subject
Next Message Daniel Fortunov 2017-01-16 23:26:59 Nested transactions support for code composability
Previous Message Jim Nasby 2017-01-16 23:14:51 Re: Turbo ODBC