Che Hodgins // Musings on Web Development

Search improvements using Sphinx, MySQL and PECL

Saturday, April 18, 2009pecl, php, sphinx

This is the second edition of my weekly PECL package series. See last week’s post to learn about the Scream extension.

This week’s topic will be on Full-Text searching using Sphinx, specifically with the PHP client extension written by Antony Dovgal and released as a 1.0 PECL package in late January 2009.

Background

Sphinx is an open source full-text search engine. It provides an alternative to MySQL full-text searching. Its main features include high search speed (avg query is under 0.1 sec on 2-4 GB text collections), high scalability (up to 100 GB of text, up to 100 M documents on a single CPU) and most importantly, native support for MySQL (MyISAM and InnoDB) and PostgreSQL . It has also proven its worth considering that it is used on web sites such as Craigslist, Netlog, and The Pirate Bay.

Sphinx Install

There are two methods of using Sphinx in PHP: Using the PHP API or using the native libaries with the PECL package. We will of course be covering the PECL version :)

Installation a basic version of sphinx is easy:

1
2
3
mbpro:sphinx-0.9.8.1 chehodgins$ sudo ./configure --prefix /usr/local/share/sphinx --with-mysql /usr/local/share/mysql/
mbpro:sphinx-0.9.8.1 chehodgins$ sudo make
mbpro:sphinx-0.9.8.1 chehodgins$ sudo make install

Next, using the sphinx.conf configuration file a data source and index must be defined. I have added a table named `track` in my MySQL database with 7.8 million track names.

1
2
mbpro:etc chehodgins$ sudo cp sphinx.conf.dist sphinx.conf
mbpro:etc chehodgins$ sudo vi sphinx.conf

In sphinx.conf:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
source track
{
type                                    = mysql

sql_host                                = localhost
sql_user                                = root
sql_pass                                = root
sql_db                                  = test
sql_port                                = 3306

sql_sock                                = /Applications/MAMP/tmp/mysql/mysql.sock
sql_query_pre                 = SET NAMES utf8

# the data to be indexed
sql_query  = SELECT id, name, length, year FROM track;

}

index track_index
{
# document source(s) to index
source                  = track

# index files path and file name, without extension
# mandatory, path must be writable, extensions will be auto-appended
path                    = /usr/local/share/sphinx/var/data/track_index

min_word_len            = 1
}

We can now index our data and start the sphinx server:

1
2
3
4
5
6
7
8
mbpro:sphinx chehodgins$ sudo bin/indexer track_index
mbpro:sphinx chehodgins$ sudo /usr/local/share/sphinx/bin/searchd

Sphinx 0.9.8.1-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/share/sphinx/etc/sphinx.conf'...
creating server socket on 0.0.0.0:3312

The data indexing took 1 minute on 7.8 million rows (204 MB of data) at a speed of 116655.16 docs/sec! Note that indexing should be done on regular intervals, depending on how fresh the data is required to be.

PHP/PECL Install

With our data indexed we must now get access to the Sphinx API. This is done using the Sphinx PECL extension. Before installating the PECL package we must install libsphinxclient, which is included in the Sphinx distribution:

1
2
3
mbpro:libsphinxclient chehodgins$ cd sphinx-0.9.8.1/api/libsphinxclient/
mbpro:libsphinxclient chehodgins$ LIBTOOLIZE=glibtoolize sudo ./buildconf.sh
mbpro:libsphinxclient chehodgins$ sudo ./configure && make install

Now we are ready to install the PECL package:

1
mbpro:~ chehodgins$ sudo pecl install sphinx

Now it must be added to php.ini:

1
extension=sphinx.so

Restart apache and check that it is installed:

Sphinx in phpinfo

Sphinx in phpinfo

Now it’s simply a matter of using the Sphinx function reference on php.net to query your dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
<?php

$sphinx = new SphinxClient();
$sphinx->setServer("localhost", 3312);
$sphinx->setMatchMode(SPH_MATCH_ALL);
$sphinx->setMaxQueryTime(500); // Limit query to 500 milliseconds
$sphinx->setLimits(0, 10, 1000); // return first 10 results

$result = $sphinx->query('Ride the Lightning');
var_dump($result['matches']);

echo $result['total_found'] . ' total results found.';
?>

Thanks to the Sphinx log, you can see that the query executed in .042 seconds:

1
[Sat Apr 18 01:33:58.878 2009] 0.042 sec [all/0/rel 160 (0,10)] [*] Ride the Lightning

Conclusion

The example was kept simple, but queries can be refined even more using SQL-like methods of the Sphinx API. Notably, setGroupBy() will do the equivalent of GROUP BY and ORDER BY. Also, setFilter() will add extra filtering on other columns in the dataset.

This is the tip of the iceberg of the different ways that Sphinx can be used. The easy integration with MySQL combined with the ease of setup make it a logical next step when MySQL’s Full-Text indexing performance degrades. It also appears capable of scaling to the needs of the top-tiered websites out there. As such, I would seriously consider Sphinx when looking for solutions to your searching needs.

Finally, it would be worthwhile to explore alternatives such as Lucene (Java), Solr (Java), and Marjory (PHP).

  • Digg
  • del.icio.us
  • Facebook
  • Reddit
  • StumbleUpon
  • TwitThis

Tags: , ,

blog comments powered by Disqus