Pl/Java - next step?

From: "Thomas Hallgren" <thhal(at)mailblocks(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Pl/Java - next step?
Date: 2004-02-21 10:04:10
Message-ID: c17ae3$dst$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Two Pl/Java implementations exists today. Due to the architecture of
PostgreSQL, compromises have been made in both of them to deal with the fact
that each connection lives in its own process. One, I'll call it
"Pl/Java_JNI" will spawn a JVM on demand for each connection and the other,
"Pl/Java_remote", will spawn at least one JVM that lives in a process of its
own and use an inter-process calling mechanism.

I can see PostgreSQL moving forward in one of four different directions:

1. Select Pl/Java_JNI.
2. Select Pl/Java_remote
3. Choose both and agree on the SQL + Java semantics
4. Make the postmaster spawn threads rather than processes (controversial?
Nah :-) )

As the one behind Pl/Java_JNI I'm perhaps not the most objective person when
it comes to choice, but I'll make an effort here and try to list the pros
and cons with each choice. My objective is to start a healthy discussion. I
think Pl/Java migth boost usability of PostgreSQL quite a bit and with an
almost explosive growth of the Java Community its essential that we conclude
this sooner rather than later.

** 1. Select Pl/Java_JNI **
#Pros:#
- Each call becomes extremely lightweight.
JNI is in essence a straight forward in-process function invocation.
Minimizing call overhead becomes very important for functions that a) are
called very often and b) functions that need to call back into the backend
several times.

- Minimum resource utilization when passing values.
Values can be passed by reference. TriggerData, TupleDesc, HeapTuple, byte
arrays etc. need not be copied. Return values can be allocated directly in
the correct MemoryContext.

- Transaction visibility
Using a JDBC driver that's implemented directly on top of SPI ensures that
the transaction visibility is correct without the need to either propagate a
transaction context or make remote calls back into the backend.

- Connection isolation
Easy to use since the developer "owns" the whole JVM. There's no need to
terminate all connections in order to replace code or to establish a debug
session. Migration can take place gradually.

- Simplicity
No hassle setting up inter-process communication or maintaining a separate
JVM.

- Modern JVM's are less demanding
Sun and other JVM vendors are making serious efforts to make the JVM more
adaptable. Java is not used for heavy weight server processing only. Small
utility programs become more and more common. Thus, decreasing start-up time
and ability to adapt resource consumption have very high priority. Look here
what Java 1.5 does
http://java.sun.com/j2se/1.5.0/docs/relnotes/features.html#vm.

- Well knonw programming envionment
JNI is standard. A potential developer of the code have access to on-line
training.

#Cons:#
- Resource consumption.
A JVM is expensive from a resource perspective.

- Connection start-up time is high.
Booting a JVM takes time. Setups where connections that makes invocations to
Pl/Java are closed and created frequently will suffer from this.

- Java execution model differs from the one used by PostgreSQL
Java uses multithreading wether you like it or not. And the JVM will throw
exceptions. The Pl/Java_JNI handles this by introducing some macros that a
potential developer that makes additions to the port must be aware of. This
also introduces limitations for the user of Pl/Java JNI (such as very
limited functionality once an error has been generated by the backend).

** 2. Select Pl/Java_remote **
#Pros:#
- Each connection becomes fairly lightweight.
A connection is represented as a thread in the remote JVM. Threads are much
less expensive than a full-blown JVM.

- Connection start-up time is low
Startup time will be very quick since thread creation is cheap. Even quicker
if a thread-pool is utilized.

- Reuse of an existing JVM
Small systems might use the same JVM to run an app-server as the one used by
triggers and functions. Albeit not great from a "separation of concern"
perspective, it might be very efficient for special needs.

- Ability to run the JVM on another server
The JVM can run on a server different from the one running the backend
process. If the number of calls are few in relation to the actual work
performed in each call, this might be interesting.

#Cons:#
- RPC calls are slow
Call between processes are inherently very slow compared to in-process
calls.

- RPC resources needed
Each connection will need an additional socket or shared memory segment.

- Transaction visibility
A connection established in the remote JVM must have the same transaction
visibility as the invoker. In essence, a transaction context must be
propagated to the remote JVM, or the remote JVM must have a JDBC driver that
calls back into the backend.

- RPC management
CORBA or some other mechanism must be installed and maintained.

- Starting/Stopping JVM affects all connections
Attaching a debugger or generating profiling information implies a restart
of the JVM, killing all existing connections that make use of
Pl/Java_remote. Code migration implies full stop + restart (The JSR121
Isolation API didn't make it into the 1.5 release).

- Complex programming envionment
A potential developer of the code base have a lot to learn. The API between
backend and Java code is non-standard.

** 3. Choose both and agree on the SQL + Java semantics **
#Pros:#
- Best of two worlds
The user can decide, depending on his/ her setup, thus gaining optimal
performance.

- Everyone wins
Nobody needs to feel sad when their implementation was rejected.

#Cons:#
- Might be perceived as a kludge
The competitors don't need multiple implementations. Introducing two ways of
doing it might be perceived as ways to get around a less then perfect design
with uncertainties and choice of another database as the result.

- The choice is not evident
The user have to make a choice. Sometimes the choice is not evident.

- Project synchronization
Someone needs to synchronize the projects.

- Double effort
Almost everything needs to be developed twice since the approaches have
fundamental differences.

** 4. Make the postmaster spawn threads rather than processes **
I know this is very controversial and perhaps I should not bring it up at
all. But then again, why not? Most readers are open-minded right?

#Pros:#
- Really best of two words
There would be one JVM per postmaster and in-process calls would be used
throughout

- Other pl<lang> could benefit?
Other languages where multithreading is an option could benefit the same way
Java does.

- Other pros
Beyond the scope of the topic.

#Cons:#
- Code rewrite
Right. All PostgreSQL code would need an overhaul. That would be a serious
effort to say the least.

- Code base selection
We'd still need to choose what existing Pl/Java implementation that should
be used as base for the in-process + multithreaded implementation.

- Other cons
Beyond the scope of the topic.

What are the next steps? Setting up benchmarking and test performance
perhaps? Should not be done my me, nor by the people behind the
Pl/Java_remote, but rather by someone who is truly objective.

Kind regards

Thomas Hallgren

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2004-02-21 12:31:51 Re: Progress Report on Materialized Views
Previous Message Nicolai Tufar 2004-02-21 07:52:32 Re: 7.4.1 release status - Turkish Locale