Search improvements using Sphinx, MySQL and PECL
This is the second edition of my weekly PECL package series. See last week’s post to learn about the Scream extension.
This week’s topic will be on Full-Text searching using Sphinx, specifically with the PHP client extension written by Antony Dovgal and released as a 1.0 PECL package in late January 2009.
Background
Sphinx is an open source full-text search engine. It provides an alternative to MySQL full-text searching. Its main features include high search speed (avg query is under 0.1 sec on 2-4 GB text collections), high scalability (up to 100 GB of text, up to 100 M documents on a single CPU) and most importantly, native support for MySQL (MyISAM and InnoDB) and PostgreSQL . It has also proven its worth considering that it is used on web sites such as Craigslist, Netlog, and The Pirate Bay.
Sphinx Install
There are two methods of using Sphinx in PHP: Using the PHP API or using the native libaries with the PECL package. We will of course be covering the PECL version
Installation a basic version of sphinx is easy:
1 2 3 | mbpro:sphinx-0.9.8.1 chehodgins$ sudo ./configure --prefix /usr/local/share/sphinx --with-mysql /usr/local/share/mysql/ mbpro:sphinx-0.9.8.1 chehodgins$ sudo make mbpro:sphinx-0.9.8.1 chehodgins$ sudo make install |
Next, using the sphinx.conf configuration file a data source and index must be defined. I have added a table named `track` in my MySQL database with 7.8 million track names.
1 2 | mbpro:etc chehodgins$ sudo cp sphinx.conf.dist sphinx.conf mbpro:etc chehodgins$ sudo vi sphinx.conf |
In sphinx.conf:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | source track { type = mysql sql_host = localhost sql_user = root sql_pass = root sql_db = test sql_port = 3306 sql_sock = /Applications/MAMP/tmp/mysql/mysql.sock sql_query_pre = SET NAMES utf8 # the data to be indexed sql_query = SELECT id, name, length, year FROM track; } index track_index { # document source(s) to index source = track # index files path and file name, without extension # mandatory, path must be writable, extensions will be auto-appended path = /usr/local/share/sphinx/var/data/track_index min_word_len = 1 } |
We can now index our data and start the sphinx server:
1 2 3 4 5 6 7 8 | mbpro:sphinx chehodgins$ sudo bin/indexer track_index mbpro:sphinx chehodgins$ sudo /usr/local/share/sphinx/bin/searchd Sphinx 0.9.8.1-release (r1533) Copyright (c) 2001-2008, Andrew Aksyonoff using config file '/usr/local/share/sphinx/etc/sphinx.conf'... creating server socket on 0.0.0.0:3312 |
The data indexing took 1 minute on 7.8 million rows (204 MB of data) at a speed of 116655.16 docs/sec! Note that indexing should be done on regular intervals, depending on how fresh the data is required to be.
PHP/PECL Install
With our data indexed we must now get access to the Sphinx API. This is done using the Sphinx PECL extension. Before installating the PECL package we must install libsphinxclient, which is included in the Sphinx distribution:
1 2 3 | mbpro:libsphinxclient chehodgins$ cd sphinx-0.9.8.1/api/libsphinxclient/ mbpro:libsphinxclient chehodgins$ LIBTOOLIZE=glibtoolize sudo ./buildconf.sh mbpro:libsphinxclient chehodgins$ sudo ./configure && make install |
Now we are ready to install the PECL package:
1 | mbpro:~ chehodgins$ sudo pecl install sphinx |
Now it must be added to php.ini:
1 | extension=sphinx.so |
Restart apache and check that it is installed:
Now it’s simply a matter of using the Sphinx function reference on php.net to query your dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 | <?php $sphinx = new SphinxClient(); $sphinx->setServer("localhost", 3312); $sphinx->setMatchMode(SPH_MATCH_ALL); $sphinx->setMaxQueryTime(500); // Limit query to 500 milliseconds $sphinx->setLimits(0, 10, 1000); // return first 10 results $result = $sphinx->query('Ride the Lightning'); var_dump($result['matches']); echo $result['total_found'] . ' total results found.'; ?> |
Thanks to the Sphinx log, you can see that the query executed in .042 seconds:
1 | [Sat Apr 18 01:33:58.878 2009] 0.042 sec [all/0/rel 160 (0,10)] [*] Ride the Lightning |
Conclusion
The example was kept simple, but queries can be refined even more using SQL-like methods of the Sphinx API. Notably, setGroupBy() will do the equivalent of GROUP BY and ORDER BY. Also, setFilter() will add extra filtering on other columns in the dataset.
This is the tip of the iceberg of the different ways that Sphinx can be used. The easy integration with MySQL combined with the ease of setup make it a logical next step when MySQL’s Full-Text indexing performance degrades. It also appears capable of scaling to the needs of the top-tiered websites out there. As such, I would seriously consider Sphinx when looking for solutions to your searching needs.
Finally, it would be worthwhile to explore alternatives such as Lucene (Java), Solr (Java), and Marjory (PHP).

