Sphinx indexing problem

From: Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>
To: "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org>
Subject: Sphinx indexing problem
Date: 2010-05-23 22:36:34
Message-ID: 4BF9ADF2.7030200@vmsinfo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
I am trying to create a Sphinx index on a fairly large Postgres table.
My <br>
problem is the fact that the Postgres API is trying to put the entire <br>
result set into the memory:<br>
<tt><font color="#3333ff"><br>
</font></tt>
<blockquote><tt><font color="#3333ff">[root(at)medo etc]# ../bin/indexer
--all</font></tt><br>
<tt><font color="#3333ff">Sphinx 0.9.9-release (r2117)</font></tt><br>
<tt><font color="#3333ff">Copyright (c) 2001-2009, Andrew Aksyonoff</font></tt><br>
<br>
<tt><font color="#3333ff">using config file
'/usr/local/etc/sphinx.conf'...</font></tt><br>
<tt><font color="#3333ff">indexing index 'test1'...</font></tt><br>
<u><b><tt><font color="#3333ff"><font color="#ff0000">E</font><font
color="#ff0000">RROR: index 'test1': sql_query: out of memory for
query result</font></font></tt></b></u><br>
<tt><font color="#3333ff">&nbsp;(DSN=pgsql://<a class="moz-txt-link-freetext" href="news:***(at)medo:5432/news">news:***(at)medo:5432/news</a>).</font></tt><br>
<tt><font color="#3333ff">total 0 docs, 0 bytes</font></tt><br>
<tt><font color="#3333ff">total 712.593 sec, 0 bytes/sec, 0.00
docs/sec</font></tt><br>
<tt><font color="#3333ff">total 0 reads, 0.000 sec, 0.0 kb/call avg,
0.0 msec/call avg</font></tt><br>
<tt><font color="#3333ff">total 0 writes, 0.000 sec, 0.0 kb/call avg,
0.0 msec/call avg</font></tt><br>
</blockquote>
<br>
Corresponding log entries on the Postgres side are:<br>
<br>
<blockquote><tt><font color="#3333ff">STATEMENT:&nbsp; SELECT
segment_id,air_date,start_time,end_time,source_type, </font></tt><br>
<tt><font color="#3333ff">market_name,station_name,program_name,
content_text FROM news_segments</font></tt><br>
<tt><font color="#3333ff">LOG:&nbsp; unexpected EOF on client connection</font></tt><br>
<tt><font color="#3333ff">LOG:&nbsp; unexpected EOF on client connection</font></tt><br>
<tt><font color="#3333ff">LOG:&nbsp; unexpected EOF on client connection</font></tt><br>
</blockquote>
<br>
The Postgres message isn't exactly helpful, but given the
circumstances, it can't be more helpful. The problem is on the client
side. The table I am using is pretty large and has 14.3 million rows:<br>
<br>
<blockquote><tt><font color="#3333ff">news=&gt; select count(*) from
news_segments;</font></tt><br>
<tt><font color="#3333ff">&nbsp; count&nbsp;&nbsp; </font></tt><br>
<tt><font color="#3333ff">----------</font></tt><br>
<tt><font color="#3333ff">&nbsp;14366286</font></tt><br>
<tt><font color="#3333ff">(1 row)</font></tt><br>
</blockquote>
<tt><font color="#3333ff"></font></tt>
<blockquote><tt><font color="#3333ff">Time: 233759.639 ms</font></tt><br>
</blockquote>
<br>
Is there anything I can do to prevent the API from attempting to put
the <br>
entire query result in memory?&nbsp; I can partition the table, create <br>
separate indexes and merge them, but that is a large unnecessary <br>
maintenance. I also suspect that the other queries with a large result <br>
set will start to fail.<br>
<br>
I temporarily solved my problem by using "range query" option offered
by <br>
sphinx:<br>
<tt><font color="#3333ff"><br>
</font></tt>
<blockquote><tt><font color="#3333ff">sql_query_range = \</font></tt><br>
<tt><font color="#3333ff">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; SELECT
min(segment_id),max(segment_id) FROM news_segments</font></tt><br>
<tt><font color="#3333ff">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sql_range_step=10000</font></tt><br>
<tt><font color="#3333ff">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
sql_query&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = \</font></tt><br>
<tt><font color="#3333ff">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; SELECT </font></tt><br>
<tt><font color="#3333ff">segment_id,air_date,start_time,end_time,source_type,
\</font></tt><br>
<tt><font color="#3333ff">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
market_name,station_name,program_name, </font></tt><br>
<tt><font color="#3333ff">segment_text \</font></tt><br>
<tt><font color="#3333ff">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; FROM news_segments \</font></tt><br>
<tt><font color="#3333ff">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WHERE
segment_id&gt;=$start and segment_id&lt;$end</font></tt><br>
</blockquote>
<br>
Segment_id is a numeric field and the query will be executed many
times, <br>
which is less than optimal. It does make the thing work, though. Would
it <br>
be possible to set maximum memory for the query result caching from the
<br>
API itself? How can I increase the maximum memory size used by the
client API?<br>
<br>
<pre class="moz-signature" cols="72">--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
<a class="moz-txt-link-abbreviated" href="http://www.vmsinfo.com">www.vmsinfo.com</a>
</pre>
</body>
</html>

Attachment Content-Type Size
unknown_filename text/html 5.3 KB

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Joshua Tolley 2010-05-24 00:48:49 Re: Sphinx indexing problem
Previous Message Mladen Gogala 2010-05-23 01:44:13 Re: Transferring Data between databases