GSoC application: MADlib k-medoids clustering

From: Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: GSoC application: MADlib k-medoids clustering
Date: 2014-03-18 23:45:10
Message-ID: CAJeaomV+mxkiYfWwYBOefbpJPe2JUKqysuA=5A7WB0PvPw_uGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

Some of you may remember me from last year: I had applied to implement the
k-medoids clustering method in MADlib, a Postgres and GreenPlum library.
As this project could not be selected last year, I'm trying again now!

You can find my application at this address:
https://github.com/viodlen/gsoc_2014
I've also pasted it at the end of this email.

As for me: I'm Maxence Ahlouche, a French student in computer science. I've
been studying IT for almost 4 years now, the first two of them being in a
technical degree. I'm currently in an engineering school, and am supposed
to graduate next year.

However, there's a problem with my application: the GSoC is supposed to
begin on may 19, but I have lectures and exams until at least June 4, and
the official end of my school year is on June 13. The period between June 4
and June 14 is intended for students to finish their school projects, if
they haven't already.
In the worst case, I won't have time to do anything significant for the
first 4 weeks.

Still, I think I'm able to handle this situation, and this worst case is
quite unlikely. Between May 19 and June 4, I'll have lectures and exams,
but it should still be a quiet time, and I'll have time to work on my GSoC
project the evenings and week-ends. After June 4, I'll only have my school
projects to finish and present, and I think I'll be able to spend most of
my time on GSoC.
I've taken it into account in my proposal's schedule, that's why the first
part takes almost as long as the second, even though it looks easier to me.

Of course, if the community doesn't want to take this risk, I'll understand.

Regards,
Maxence Ahlouche

My proposal:

> GSoC 2014 proposal: implementing clustering algorithms in MADlib
> ================================================================
>
> Synopsis
> --------
>
> This project aims to implement some clustering algorithms in MADlib,
> which is a data analytics and machine learning library for PostgreSQL,
> Greenplum and HAWQ.
>
> Benefits to the PostgreSQL community
> ------------------------------------
>
> Currently, only the k-means clustering algorithm is implemented in
> MADlib (see the doc:
> http://doc.madlib.net/latest/group__grp__clustering.html ). The
> k-medoids algorithm, while being computationnally more intensive, is
> much less sensitive to outliers (points that don't belong obviously to
> one cluster or another). This is interesting on noisy datasets, that's
> why I'm planning to implement it during the first part of the GSoC.
>
> Still, these algorithms are based on distance computation, therefore
> they can only find convex clusters. That's why I'm proposing to
> implement the OPTICS (*ordering points to identify the clustering
> structure*, see http://en.wikipedia.org/wiki/OPTICS_algorithm ), which
> addresses this issue, as the second part of this GSoC project.
>
> The PostgreSQL community would benefit from these features, as it
> would make available clustering algorithms more powerful than simple
> k-means.
>
> Project details
> ---------------
>
> k-medoids
> """""""""
>
> The first goal of this project is to implement the k-medoids
> clustering algorithm. For this, I'll first spend some time studying
> the k-means algorithm, as both will probably be pretty similar. This
> will also allow me to get familiar with the codebase, the conventions,
> the data structures I'll need, etc.
>
> Then I'll implement, test and debug the algorithm. If relevant, I'll
> also provide a "k-medoids++" version, which, similarly to the
> k-means++ function in MADlib, will chose the initial centroids
> depending on the dataset, instead of chosing them randomly. This
> allows to detect small clusters located far from the others (which are
> usually detected as part of an other bigger cluster using the standard
> algorithm).
>
> The final step would be to refactor the code from k-means and
> k-medoids to remove any code duplication introduced in this first
> part.
>
> OPTICS
> """"""
>
> The second part of this project would be to implement the
> density-based clustering algorithm OPTICS, which would overcome the
> main problem of both the k-means and k-medoids algorithm: non-convex
> clusters. This algorithm has been preferred over DBSCAN
> (http://en.wikipedia.org/wiki/DBSCAN ) as it is able to detect
> clusters of different densities, and, consequently, overlapping
> clusters.
>
> I'll first take some time to understand full well the algorithm, and
> make a prototype in Python, to be sure I know how it works. Then I'll
> actually implement it, test it, and debug it in MADlib.
>
> If, after that, any time's left, I'll consider implementing some
> of the improvements of k-means and k-medoids that we can find in the
> litterature.
>
> Deliverables
> ------------
>
> * the k-medoids algorithm in MADlib;
> * the OPTICS algorithm, also in MADlib;
> * optionnally, some improvements on k-means and/or k-medoids.
>
> Project Schedule
> ----------------
>
> #. Implementation of the k-medoids algorithm: from 19/05 to 30/05
> #. Tests, debug and doc of k-medoids: from 31/05 to 13/06
> #. Prototype of OPTICS in Python: from 14/06 to 18/06
> #. Actual implementation of OPTICS in MADlib: from 19/06 to 25/06
> #. Tests, debug and doc of OPTICS: from 25/06 to 11/06
>

--
Maxence Ahlouche
06 06 66 97 00

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-03-19 00:30:01 Re: pg_archivecleanup bug
Previous Message Tom Lane 2014-03-18 23:41:32 Re: Leaking regexp_replace in 9.3.1 ? (was: [HACKERSUninterruptable regexp_replace in 9.3.1 ?)