Re: Does people favor to have matrix data type?

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Joe Conway <mail(at)joeconway(dot)com>, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Does people favor to have matrix data type?
Date: 2016-06-01 14:31:47
Message-ID: a0758711-876a-861c-9379-cc9d26e0a0db@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/30/16 9:05 PM, Kouhei Kaigai wrote:
> Due to performance reason, location of each element must be deterministic
> without walking on the data structure. This approach guarantees we can
> reach individual element with 2 steps.

Agreed.

On various other points...

Yes, please keep the discussion here, even when it relates only to PL/R.
Whatever is being done for R needs to be done for plpython as well. I've
looked at ways to improve analytics in plpython related to this, and it
looks like I need to take a look at the fast-path function stuff. One of
the things I've pondered for storing ndarrays in Postgres is how to
reduce or eliminate the need to copy data from one memory region to
another. It would be nice if there was a way to take memory that was
allocated by one manager (ie: python's) and transfer ownership of that
memory directly to Postgres without having to copy everything. Obviously
you'd want to go the other way as well. IIRC cython's memory manager is
the same as palloc in regard to very large allocations basically being
ignored completely, so this should be possible in that case.

One thing I don't understand is why this type needs to be limited to 1
or 2 dimensions? Isn't the important thing how many individual elements
you can fit into GPU? So if you can fit a 1024x1024, you could also fit
a 100x100x100, a 32x32x32x32, etc. At low enough values maybe that stops
making sense, but I don't see why there needs to be an artificial limit.
I think what's important for something like kNN is that the storage is
optimized for this, which I think means treating the highest dimension
as if it was a list. I don't know if it then matters whither the lower
dimensions are C style vs FORTRAN style. Other algorithms might want
different storage.

Something else to consider is the 1G toast limit. I'm pretty sure that's
why MADlib stores matricies as a table of vectors. I know for certain
it's a problem they run into, because they've discussed it on their
mailing list.

BTW, take a look at MADlib svec[1]... ISTM that's just a special case of
what you're describing with entire grids being zero (or vice-versa).
There might be some commonality there.

[1] https://madlib.incubator.apache.org/docs/v1.8/group__grp__svec.html
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-06-01 14:44:58 Re: Parallel safety tagging of extension functions
Previous Message Tom Lane 2016-06-01 14:27:54 Re: Floating point comparison inconsistencies of the geometric types