Che Hodgins // Musings on Web Development

Archive for the ‘pecl’ Category

Free and Fast Geolocation in PHP

Geo* (as I call them) are the web technologies that provide a link between online content and Earth’s geography. Examples includes Geocoding (finding latitude/longitude based on street addresses), Geotagging (tagging media with latitude/longitude coordinates), and Geolocation (finding latitude/longitude of a computer).

Geolocation is a particularly cool technique because it allows you to estimate a person’s geographic location, thus allowing you to provide a custom tailored experience on your website, among other things. This can be useful as much as it can be annoying. There are several methods of Geolocation, some as simple as asking the user where they are located. This article focuses on adding IP based Geolocation to your PHP website for free all the while keeping it fast.

Problems

If IP addresses are to be used to determine a persons physical location then a few possible problems come to mind:

  • How accurate is the mapping between an IP address and a geographical location?
    • From maxmind.com’s Geolocation service: “99.8% accurate on a country level, 90% accurate on a state level, and 83% accurate for the US within a 25 mile radius.”. Doing some research, the matching is done using either the address of the ISP that owns that IP [link], or by buying the data from websites that ask for users locations [link].
  • What about users behind proxies?
    • Some Geolocation databases flag the IPs of potential anonymous proxy servers.
    • Most proxy servers send X-Forwarded-For and Client-IP headers that you can use.

This is not perfect, but in many cases the approximate geographical location of a user can be inferred.

Demo time

This demo will use the free Geolocation database provided by Maxmind.com. I believe this is the ideal choice for normal (i.e. not Facebook) websites for several reasons:

  • It is free (there is a paid version with higher accuracy)
  • It is fast. They report up to 1 million queries per second on 1 machine.
  • It is extensible. The database can be upgraded to the paid version by just replacing the binary.
  • They like developers. They provide implementations in over 10 different programming languages, with benchmarks.
  • Their website is full of valuable information. They provide benchmarks, an explanation of how they collect their data, and more. I haven’t seen this with any other IP Geolocation services.

There are two options for us PHP developers. The pure PHP library or a PECL package implementing the C library. For reasons that will be discussed below, the PECL package will be used. If you do not want to use a PECL package or are on a hosted server, then you can download the pure PHP classes here.

First, the GeoIP C library must be downloaded (link) and installed. Note that this can be installed on windows as well. No special options are needed to install it:

1
2
3
mbpro:GeoIP-1.4.6 chehodgins$ sudo ./configure
mbpro:GeoIP-1.4.6 chehodgins$ sudo make
mbpro:GeoIP-1.4.6 chehodgins$ sudo make install



Then the PECL package can be installed:

1
2
3
4
mbpro:~ chehodgins$ sudo pecl install geoip
downloading geoip-1.0.7.tar ...
...
You should add "extension=geoip.so" to php.ini



Next, add this extension to php.ini (i.e. extension=geoip.so), restart apache and check out phpinfo():

GeoIP in phpinfo()

GeoIP in phpinfo()



The final step before writing code is to download the actual database. It is updated monthly so remember to stay up to date. The directory that should contain the file is OS dependent, so create a quick php script to see where the directory is:

1
2
3
4
ini_set('display_errors', true);
error_reporting(E_ALL | E_STRICT);
$result = geoip_record_by_name('72.30.81.165');
var_dump($result);



Gives us:

Determine the binary directory


Now save the binary to the directory mentioned in the php warning, reload your script, and the warning should disappear. Let’s try again with some more code:

1
2
3
4
5
6
7
8
ini_set('display_errors', true);
error_reporting(E_ALL | E_STRICT);

$functions = get_extension_funcs('geoip');
var_dump($functions);

$result = geoip_record_by_name('72.30.81.165');
var_dump($result);



Gives:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
array
0 => string 'geoip_database_info' (length=19)
1 => string 'geoip_country_code_by_name' (length=26)
2 => string 'geoip_country_code3_by_name' (length=27)
3 => string 'geoip_country_name_by_name' (length=26)
4 => string 'geoip_continent_code_by_name' (length=28)
5 => string 'geoip_org_by_name' (length=17)
6 => string 'geoip_record_by_name' (length=20)
7 => string 'geoip_id_by_name' (length=16)
8 => string 'geoip_region_by_name' (length=20)
9 => string 'geoip_isp_by_name' (length=17)
10 => string 'geoip_db_avail' (length=14)
11 => string 'geoip_db_get_all_info' (length=21)
12 => string 'geoip_db_filename' (length=17)
13 => string 'geoip_region_name_by_code' (length=25)
14 => string 'geoip_time_zone_by_country_and_region' (length=37)

array
'continent_code' => string 'NA' (length=2)
'country_code' => string 'US' (length=2)
'country_code3' => string 'USA' (length=3)
'country_name' => string 'United States' (length=13)
'region' => string 'CA' (length=2)
'city' => string 'Sunnyvale' (length=9)
'postal_code' => string '94089' (length=5)
'latitude' => float 37.4249000549
'longitude' => float -122.007400513
'dma_code' => int 807
'area_code' => int 408



With only an IP address we can easily get the country, postal code, longitude and latitude, and even the area code of the user.

Performance

I initially thought that the PECL version would outperform the pure PHP version by a small percentage. I was wrong. The PECL version was much faster. Here are some informal benchmarks.

Iterations Total Avg Notes
PECL GeoIP 10,000 0.7s .007ms per request
Pure PHP 10,000 49.2s 4.92ms per request
PECL GeoIP 1 0.08ms 0.08ms per request Typical real world usage
Pure PHP 1 2.4ms 2.4ms per request Typical real world usage

As a validation of my results I benchmarked the pure PHP library being used in a web application and had comparable results to my benchmarks (5.9ms per IP lookup versus the 2.4ms above).

Conclusion

Because of the ease of implementation, the low cost, and the minimal performance losses, there is much to be gained by adding IP Geolocation to your web application. The PECL package is the ideal configuration because it provides a faster experience with less code to maintain. The pure PHP library is none the less still relatively fast and thus still worth it. This is still far from a perfect solution. False positives can occur, anonymous proxies mess everything up, and IP addresses are constantly changing. Also, what about users’ who simply do not want to share their location? There are privacy issues. This is currently a hot topic, with the W3C geolocation API being actively worked on, including the efforts of Mozilla, Opera and others to improve the situation of location awareness on the web, something I am looking forward to it.

More reading:

Interesting Geolocation presentation
GeoIP functions in the PHP manual
Cool Geo* stuff at Y!

Tags: , , ,

Sorting out your PHP includes using Inclued

April 27, 2009pecl, phpView Comments

This is the third edition of my weekly PECL package series. Check out my Scream article as well as my Sphinx article to learn about these extensions.

If you have ever inherited spaghetti code or worse, written spaghetti code, the following article is for you. This article is an introduction to the Inclued PECL extension. It helps answer the common question “Where is this include coming from?”, something that I’ve asked myself before when working on some projects.

This extension works by overriding an opcode in Zend, allowing it to log information regarding which files are being included, and from where. This information can be collected using a single function named inclued_get_data() or by setting inclued.dumpdir in php.ini to dump the data of each request.

The final step involves graphing this data to get a view of the include hierarchy. This can be done by converting the JSON encoded output into a dot language file, and then converting it to an image or viewing it with an application such as Graphviz.

To start, we need to install the inclued PECL extension:

1
2
3
4
mbpro:~ chehodgins$ sudo pecl install inclued-alpha
downloading inclued-0.1.0.tar ...
[...]
install ok: channel://pecl.php.net/inclued-0.1.0

And add to php.ini and restart apache:

1
2
3
extension=inclued.so
inclued.enabled=1
inclued.dumpdir=/tmp

Next, in our web browser we load the page that we wish to analyze. A file named inclued.*.* will be added to /tmp. We will convert to this to a dot file using the gengraph.php script that is included in the PECL package:

1
2
3
mbpro:tmp chehodgins$ php /usr/local/lib/php/gengraph.php -i inclued.00196.2
Written inclued.out.dot...
To generate images: dot -Tpng -o inclued.png inclued.out.dot

Now we have the choice to either create a png using the dot command or simply opening with Graphviz.

1
mbpro:tmp chehodgins$ dot -Tpng -o ~/Documents/inclued.png inclued.out.dot

And a super nice graph of the includes is generated as an image. Here is the graph of the includes in WordPress (click to view fullscreen):

Inclued run on WordPress

Inclued run on WordPress

Notice that there are a lot of includes, but in general there appears to be order. Now let’s check out osCommerce:

Inclued on osCommerce

Inclued on osCommerce

This also looks decent. What about magento?

Inclued in Magento

Inclued in Magento

Holy crap, thats a lot of includes!

In conclusion, the inclued PECL extension can be useful in many situations, from trying to understand how and
why a file is being included, to reorganizing your includes by seeing the dependencies. If anything, it can be an
easy way to show off your application/framework’s include structure.

Tags: , ,

Search improvements using Sphinx, MySQL and PECL

April 18, 2009pecl, php, sphinxView Comments

This is the second edition of my weekly PECL package series. See last week’s post to learn about the Scream extension.

This week’s topic will be on Full-Text searching using Sphinx, specifically with the PHP client extension written by Antony Dovgal and released as a 1.0 PECL package in late January 2009.

Background

Sphinx is an open source full-text search engine. It provides an alternative to MySQL full-text searching. Its main features include high search speed (avg query is under 0.1 sec on 2-4 GB text collections), high scalability (up to 100 GB of text, up to 100 M documents on a single CPU) and most importantly, native support for MySQL (MyISAM and InnoDB) and PostgreSQL . It has also proven its worth considering that it is used on web sites such as Craigslist, Netlog, and The Pirate Bay.

Sphinx Install

There are two methods of using Sphinx in PHP: Using the PHP API or using the native libaries with the PECL package. We will of course be covering the PECL version :)

Installation a basic version of sphinx is easy:

1
2
3
mbpro:sphinx-0.9.8.1 chehodgins$ sudo ./configure --prefix /usr/local/share/sphinx --with-mysql /usr/local/share/mysql/
mbpro:sphinx-0.9.8.1 chehodgins$ sudo make
mbpro:sphinx-0.9.8.1 chehodgins$ sudo make install

Next, using the sphinx.conf configuration file a data source and index must be defined. I have added a table named `track` in my MySQL database with 7.8 million track names.

1
2
mbpro:etc chehodgins$ sudo cp sphinx.conf.dist sphinx.conf
mbpro:etc chehodgins$ sudo vi sphinx.conf

In sphinx.conf:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
source track
{
type                                    = mysql

sql_host                                = localhost
sql_user                                = root
sql_pass                                = root
sql_db                                  = test
sql_port                                = 3306

sql_sock                                = /Applications/MAMP/tmp/mysql/mysql.sock
sql_query_pre                 = SET NAMES utf8

# the data to be indexed
sql_query  = SELECT id, name, length, year FROM track;

}

index track_index
{
# document source(s) to index
source                  = track

# index files path and file name, without extension
# mandatory, path must be writable, extensions will be auto-appended
path                    = /usr/local/share/sphinx/var/data/track_index

min_word_len            = 1
}

We can now index our data and start the sphinx server:

1
2
3
4
5
6
7
8
mbpro:sphinx chehodgins$ sudo bin/indexer track_index
mbpro:sphinx chehodgins$ sudo /usr/local/share/sphinx/bin/searchd

Sphinx 0.9.8.1-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/share/sphinx/etc/sphinx.conf'...
creating server socket on 0.0.0.0:3312

The data indexing took 1 minute on 7.8 million rows (204 MB of data) at a speed of 116655.16 docs/sec! Note that indexing should be done on regular intervals, depending on how fresh the data is required to be.

PHP/PECL Install

With our data indexed we must now get access to the Sphinx API. This is done using the Sphinx PECL extension. Before installating the PECL package we must install libsphinxclient, which is included in the Sphinx distribution:

1
2
3
mbpro:libsphinxclient chehodgins$ cd sphinx-0.9.8.1/api/libsphinxclient/
mbpro:libsphinxclient chehodgins$ LIBTOOLIZE=glibtoolize sudo ./buildconf.sh
mbpro:libsphinxclient chehodgins$ sudo ./configure && make install

Now we are ready to install the PECL package:

1
mbpro:~ chehodgins$ sudo pecl install sphinx

Now it must be added to php.ini:

1
extension=sphinx.so

Restart apache and check that it is installed:

Sphinx in phpinfo

Sphinx in phpinfo

Now it’s simply a matter of using the Sphinx function reference on php.net to query your dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
<?php

$sphinx = new SphinxClient();
$sphinx->setServer("localhost", 3312);
$sphinx->setMatchMode(SPH_MATCH_ALL);
$sphinx->setMaxQueryTime(500); // Limit query to 500 milliseconds
$sphinx->setLimits(0, 10, 1000); // return first 10 results

$result = $sphinx->query('Ride the Lightning');
var_dump($result['matches']);

echo $result['total_found'] . ' total results found.';
?>

Thanks to the Sphinx log, you can see that the query executed in .042 seconds:

1
[Sat Apr 18 01:33:58.878 2009] 0.042 sec [all/0/rel 160 (0,10)] [*] Ride the Lightning

Conclusion

The example was kept simple, but queries can be refined even more using SQL-like methods of the Sphinx API. Notably, setGroupBy() will do the equivalent of GROUP BY and ORDER BY. Also, setFilter() will add extra filtering on other columns in the dataset.

This is the tip of the iceberg of the different ways that Sphinx can be used. The easy integration with MySQL combined with the ease of setup make it a logical next step when MySQL’s Full-Text indexing performance degrades. It also appears capable of scaling to the needs of the top-tiered websites out there. As such, I would seriously consider Sphinx when looking for solutions to your searching needs.

Finally, it would be worthwhile to explore alternatives such as Lucene (Java), Solr (Java), and Marjory (PHP).

Tags: , ,

Weekly PECL Package – Scream

April 8, 2009pecl, phpView Comments

This is the first of what is planned to be a weekly post on a more or less random PECL package. The idea is for me to get to know some PECL packages in more detail and for you to get to know some PECL packages in more detail – without losing your precious time.

For the first edition of this series I will cover the relatively new PECL package aptly named Scream. The purpose of this extension is to, well, scream. It will disable the the silence operator (@) so that any hidden errors will still be shown. After this, you may scream at whoever used the silence operator in the first place – thus the name Scream (Just kidding?).

Lets get started…

1
che-hodginss-macbook-pro:~ chehodgins$ sudo pecl install scream-alpha

After a few minutes…

1
2
3
Build process completed successfully
Installing '/usr/local/lib/php/extensions/no-debug-non-zts-20060613/scream.so'
install ok: channel://pecl.php.net/scream-0.1.0

Great, now add it to php.ini and restart apache:

1
2
extension=scream.so
scream.enabled=1

Check phpinfo and we are ready to go:

I'm new to macs and just discovered taking screenshots of portions of the screen (Apple key ⌘ + Shift + 4). Very cool.

I'm new to macs and just discovered taking screenshots of portions of the screen (Apple key ⌘ + Shift + 4). Very cool.

















Now we will borrow some code from some open source projects that use the silence operator and see what happens.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
ini_set('display_errors', 1);
error_reporting(E_ALL | E_STRICT);

echo 'starting... ';

// Initialize
$host = $user = $password = $sock = $port = $errno = $errstr = $response = '';

// From Joomla!
if (!($resource = @mysql_connect( $host, $user, $password, true ))) {
// ...
}

// From Wordpress
$response .= @ fread ( $sock, 8192 );

// From Joomla!
@ dl('bz2.so');

// From Wordpress
$sock = @fsockopen($host, $port, $errno, $errstr);

echo "done.\n";

?>



With scream.enabled = 0 we get this lovely output:

1
2
3
che-hodginss-macbook-pro:www chehodgins$ php -f scream.php
starting... done.
che-hodginss-macbook-pro:www chehodgins$

And with scream.enabled = 1:

1
2
3
4
5
6
7
8
9
10
che-hodginss-macbook-pro:www chehodgins$ php -f scream.php
starting...
Warning: fread(): supplied argument is not a valid stream resource in /Users/chehodgins/www/scream.php on line 17

Warning: dl(): Unable to load dynamic library '/Applications/MAMP/bin/php5/lib/php/extensions/no-debug-non-zts-20050922/bz2.so' - (null) in /Users/chehodgins/www/scream.php on line 20

Warning: fsockopen() expects parameter 2 to be long, string given in /Users/chehodgins/www/scream.php on line 23
done.

che-hodginss-macbook-pro:www chehodgins$


It is obvious at this moment that in general it is not advisable to use the silence operator. Most PHP programmers have been burnt by this a few times and usually will be much more harsh towards those think this feature is useful. I can recall spending lots of time bug hunting before finding an @ which lead me to a simple error. Its a painful experience, don’t do it.

As a programmer you may already steer clear of the silence operator but much code is inherited. Because of the simplicity of the silence operator it can be hard to track down where it is used in your code. Try searching for ‘@’ in one of your projects, how many thousands of results do you get? That is one reason to install this on your dev box and find those tough bugs before they hit production.

And just in case you are still not convinced, check out Five reasons the shut-up operator (@) should be avoided by Derick Rethans

Tags: ,