[ANN] pg2arrow

From: Kohei KaiGai <kaigai(at)heterodb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: [ANN] pg2arrow
Date: 2019-01-28 01:43:10
Message-ID: CAOP8fzaG+yy7fo7V7RtFCV0g3MEa2jtEf6ouNoG3_mWw7hupUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Hello,

I made a utility program to dump PostgreSQL database in Apache Arrow format.

Apache Arrow is a kind of data format for columnar-based structured
data; actively
developed by Spark and comprehensive communities.
It is suitable data representation for static and read-only but large
number of rows.
Many of data analytics tools support Apache Arrow as a common data
exchange format.
See, https://arrow.apache.org/

* pg2arrow
https://github.com/heterodb/pg2arrow

usage:
$ ./pg2arrow -h localhost postgres -c 'SELECT * FROM hogehoge LIMIT
10000' -o /tmp/hogehoge.arrow
--> fetch results of the query, then write out "/tmp/hogehoge"
$ ./pg2arrow --dump /tmp/hogehoge
--> shows schema definition of the "/tmp/hogehoge"

$ python
>>> import pyarrow as pa
>>> X = pa.RecordBatchFileReader("/tmp/hogehoge").read_all()
>>> X.schema
id: int32
a: int64
b: double
c: struct<x: int32, y: double, z: decimal(30, 11), memo: string>
child 0, x: int32
child 1, y: double
child 2, z: decimal(30, 11)
child 3, memo: string
d: string
e: double
ymd: date32[day]

--> read the Apache Arrow file using PyArrow, then shows its schema definition.

It is also a groundwork for my current development - arrow_fdw; which
allows to scan
on the configured Apache Arrow file(s) as like regular PostgreSQL table.
I expect integration of the arrow_fdw support with SSD2GPU Direct SQL
of PG-Strom
can pull out maximum capability of the latest hardware (NVME and GPU).
Likely, it is an ideal configuration for log-data processing generated
by many sensors.

Please check it.
Comments, ideas, bug-reports, and other feedbacks are welcome.

As an aside, NVIDIA announced their RAPIDS framework; to exchange data frames
on GPU among multiple ML/Analytics solutions. It also uses Apache
Arrow as a common
format for data exchange, and this is also our groundwork for them.
https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/

Thanks,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai(at)heterodb(dot)com>

Browse pgsql-general by date

  From Date Subject
Next Message 吉成恒 2019-01-28 03:11:42 type int2vector
Previous Message Adrian Klaver 2019-01-28 01:18:08 Re: Error message restarting a database

Browse pgsql-hackers by date

  From Date Subject
Next Message Imai, Yoshikazu 2019-01-28 01:44:28 RE: speeding up planning with partitions
Previous Message David Rowley 2019-01-28 01:26:10 Re: Delay locking partitions during query execution