BLOB / CLOB support in PostgreSQL

From: Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: PostgreSQL JDBC list <pgsql-jdbc(at)postgresql(dot)org>
Subject: BLOB / CLOB support in PostgreSQL
Date: 2018-07-25 20:29:08
Message-ID: CAB=Je-EN_Wc-SURAfUAgzLh-kcmxD_migPaO1Jx+SksTAMEuog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-jdbc

Hi,

According to Pgjdbc GitHub statistics, the most popular page is
https://github.com/pgjdbc/pgjdbc/issues/1102 which is
"org.postgresql.jdbc.PgConnection.createClob() is not yet implemented"
issue (1600 visits from 1400 unique visitors per a fortnight).

There are workarounds to silence the error, however I'm sure CLOB (as in
"streaming text datatype") is not yet supported in PostgreSQL backend.

I have browsed pgsql-hackers mailing list re CLOB, and it looks like
there's no relevant discussion, so I'm quite sure I've done my homework on
"check prior mails regarding the subject".

**Issue**: there's no sensible way to map java.sql.Clob to the existing
backend datatypes. `text` can't stream (it always loads the data fully),
and it is 1GB limited.

Java distinguishes java.sql.Blob and java.sql.Clob.
Blob is a "binary stream with streaming features". It can be mapped to
existing "Large Objects", and existing Large Object API somewhat suits for
the implementation.
There are glitches (like "pgjdbc has to perform 4 API calls
tell/seek/tell/seek in order just to get LO length once"), however it is
fine.

Java Clob API is just a dozen of methods (13 to be exact), however there
are two major issues there:
1) "Large Object" is just a binary object. There's no way to tell if the
contents is a UTF-8 string or Windows-1251 string or protobuf-encoded
message or whatever.
That is if pgjdbc encodes java.sql.Clob (large string) into some form of
binary (e.g. UTF-8) and store it as PostgreSQL Large Object, then this LO
automatically becomes "pgjdbc-specific blob".
There's no way to use the data in SQL or pl/pgsql or other applications.
For instance, one can't perform " where clob_column like '%abcd%' "

2) "characters". For instance, `long length()` should return the number of
characters in the string.
If pgjdbc implements java.sql.Clob as a UTF-8 encoded binary, then it would
have to **process the whole blob** in order to measure string length.
The same thing goes for `String getSubString(long pos, int length)`. It
would have to process all the bytes up to character `long pos` (how
otherwise it would know byte position for character `pos`?).

Currently pgjdbc encodes strings using client_encoding, stores them as LO,
and has been like that for ages. Apparently that might easily produce
garbage in the DB if clients use various encodings, however pgjdbc's
default setting is to use UTF-8 so the problem should be not that visible.

I fully understand LO has issues with "removing obsolete entries", however
mapping java.sql.Clob to `text` seems to make less sense.
For instance: suppose pgjdbc choses "Clob == text". Then a client meets
"1GB" limit.

"Streaming TOAST data" looks more like a research project rather than a
clear thing to implement.

What if there was a standard of storing strings in Large Objects?
For instance: "CLOB is a UTF-8 encoded string stored as a single LO". When
such an agreement exists, various applications could read and write the
data.
Of course, UTF-8 might not suit everybody, so a format might be "prefix
that specifies encoding, then encoded string".
Of course both variations above fail to support streaming (as in "need to
process all the contents in order to get the last character"), so it might
be better to use
"prefix that specifies encoding + 'index block' (that specifies offsets for
each 1M characters) + encoded string".
I'm sure there are known algorithms to store strings in binary format that
support subsequence / overwrite / length in reasonable time (O(1) or O(N)
with reasonable constant).
There might be an option to use UTF-16 (so each "character" becomes 2 bytes
always), however it would come at a cost of space usage.

**Here goes the question**: do you think such an implementation ("large
string stored in Large Objects" could be merged into the core eventually)?

Q2: any ideas/existing libraries for random access read-write large strings
stored as binary?

PS. Relevant pgjdbc PR is https://github.com/pgjdbc/pgjdbc/pull/1272

--
Regards,
Vladimir Sitnikov

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-07-25 20:42:41 Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
Previous Message Robert Haas 2018-07-25 20:27:41 Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

Browse pgsql-jdbc by date

  From Date Subject
Next Message Craig Ringer 2018-07-30 07:13:03 [pgjdbc/pgjdbc] f28a57: Introduce AutoSave = SERVER functionality
Previous Message Kyotaro Horiguchi 2018-07-24 19:28:45 [pgjdbc/pgjdbc] 993a3b: fix: Japanese translation (#1275)