Che Hodgins // Musings on Web Development

Monthly Archive for May, 2009

Free and Fast Geolocation in PHP

Geo* (as I call them) are the web technologies that provide a link between online content and Earth’s geography. Examples includes Geocoding (finding latitude/longitude based on street addresses), Geotagging (tagging media with latitude/longitude coordinates), and Geolocation (finding latitude/longitude of a computer).

Geolocation is a particularly cool technique because it allows you to estimate a person’s geographic location, thus allowing you to provide a custom tailored experience on your website, among other things. This can be useful as much as it can be annoying. There are several methods of Geolocation, some as simple as asking the user where they are located. This article focuses on adding IP based Geolocation to your PHP website for free all the while keeping it fast.

Problems

If IP addresses are to be used to determine a persons physical location then a few possible problems come to mind:

  • How accurate is the mapping between an IP address and a geographical location?
    • From maxmind.com’s Geolocation service: “99.8% accurate on a country level, 90% accurate on a state level, and 83% accurate for the US within a 25 mile radius.”. Doing some research, the matching is done using either the address of the ISP that owns that IP [link], or by buying the data from websites that ask for users locations [link].
  • What about users behind proxies?
    • Some Geolocation databases flag the IPs of potential anonymous proxy servers.
    • Most proxy servers send X-Forwarded-For and Client-IP headers that you can use.

This is not perfect, but in many cases the approximate geographical location of a user can be inferred.

Demo time

This demo will use the free Geolocation database provided by Maxmind.com. I believe this is the ideal choice for normal (i.e. not Facebook) websites for several reasons:

  • It is free (there is a paid version with higher accuracy)
  • It is fast. They report up to 1 million queries per second on 1 machine.
  • It is extensible. The database can be upgraded to the paid version by just replacing the binary.
  • They like developers. They provide implementations in over 10 different programming languages, with benchmarks.
  • Their website is full of valuable information. They provide benchmarks, an explanation of how they collect their data, and more. I haven’t seen this with any other IP Geolocation services.

There are two options for us PHP developers. The pure PHP library or a PECL package implementing the C library. For reasons that will be discussed below, the PECL package will be used. If you do not want to use a PECL package or are on a hosted server, then you can download the pure PHP classes here.

First, the GeoIP C library must be downloaded (link) and installed. Note that this can be installed on windows as well. No special options are needed to install it:

1
2
3
mbpro:GeoIP-1.4.6 chehodgins$ sudo ./configure
mbpro:GeoIP-1.4.6 chehodgins$ sudo make
mbpro:GeoIP-1.4.6 chehodgins$ sudo make install



Then the PECL package can be installed:

1
2
3
4
mbpro:~ chehodgins$ sudo pecl install geoip
downloading geoip-1.0.7.tar ...
...
You should add "extension=geoip.so" to php.ini



Next, add this extension to php.ini (i.e. extension=geoip.so), restart apache and check out phpinfo():

GeoIP in phpinfo()

GeoIP in phpinfo()



The final step before writing code is to download the actual database. It is updated monthly so remember to stay up to date. The directory that should contain the file is OS dependent, so create a quick php script to see where the directory is:

1
2
3
4
ini_set('display_errors', true);
error_reporting(E_ALL | E_STRICT);
$result = geoip_record_by_name('72.30.81.165');
var_dump($result);



Gives us:

Determine the binary directory


Now save the binary to the directory mentioned in the php warning, reload your script, and the warning should disappear. Let’s try again with some more code:

1
2
3
4
5
6
7
8
ini_set('display_errors', true);
error_reporting(E_ALL | E_STRICT);

$functions = get_extension_funcs('geoip');
var_dump($functions);

$result = geoip_record_by_name('72.30.81.165');
var_dump($result);



Gives:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
array
0 => string 'geoip_database_info' (length=19)
1 => string 'geoip_country_code_by_name' (length=26)
2 => string 'geoip_country_code3_by_name' (length=27)
3 => string 'geoip_country_name_by_name' (length=26)
4 => string 'geoip_continent_code_by_name' (length=28)
5 => string 'geoip_org_by_name' (length=17)
6 => string 'geoip_record_by_name' (length=20)
7 => string 'geoip_id_by_name' (length=16)
8 => string 'geoip_region_by_name' (length=20)
9 => string 'geoip_isp_by_name' (length=17)
10 => string 'geoip_db_avail' (length=14)
11 => string 'geoip_db_get_all_info' (length=21)
12 => string 'geoip_db_filename' (length=17)
13 => string 'geoip_region_name_by_code' (length=25)
14 => string 'geoip_time_zone_by_country_and_region' (length=37)

array
'continent_code' => string 'NA' (length=2)
'country_code' => string 'US' (length=2)
'country_code3' => string 'USA' (length=3)
'country_name' => string 'United States' (length=13)
'region' => string 'CA' (length=2)
'city' => string 'Sunnyvale' (length=9)
'postal_code' => string '94089' (length=5)
'latitude' => float 37.4249000549
'longitude' => float -122.007400513
'dma_code' => int 807
'area_code' => int 408



With only an IP address we can easily get the country, postal code, longitude and latitude, and even the area code of the user.

Performance

I initially thought that the PECL version would outperform the pure PHP version by a small percentage. I was wrong. The PECL version was much faster. Here are some informal benchmarks.

Iterations Total Avg Notes
PECL GeoIP 10,000 0.7s .007ms per request
Pure PHP 10,000 49.2s 4.92ms per request
PECL GeoIP 1 0.08ms 0.08ms per request Typical real world usage
Pure PHP 1 2.4ms 2.4ms per request Typical real world usage

As a validation of my results I benchmarked the pure PHP library being used in a web application and had comparable results to my benchmarks (5.9ms per IP lookup versus the 2.4ms above).

Conclusion

Because of the ease of implementation, the low cost, and the minimal performance losses, there is much to be gained by adding IP Geolocation to your web application. The PECL package is the ideal configuration because it provides a faster experience with less code to maintain. The pure PHP library is none the less still relatively fast and thus still worth it. This is still far from a perfect solution. False positives can occur, anonymous proxies mess everything up, and IP addresses are constantly changing. Also, what about users’ who simply do not want to share their location? There are privacy issues. This is currently a hot topic, with the W3C geolocation API being actively worked on, including the efforts of Mozilla, Opera and others to improve the situation of location awareness on the web, something I am looking forward to it.

More reading:

Interesting Geolocation presentation
GeoIP functions in the PHP manual
Cool Geo* stuff at Y!

Tags: , , ,

Simplifying Data Filtering

This post doesn’t strictly follow my weekly PECL package series per se, but is related by the fact that the subject was briefly an experimental PECL package.

Reinventing the wheel. This is something that programmers do over and over and over again. I have come up with a few hypotheses as to why this is the case:

  • Don’t know any better. This is the type of programmer that wants to start writing code right away and doesn’t wonder if a solution already exists.
  • Doesn’t trust the “wheel”. This is the person who would rather write something from scratch because they don’t trust anyone else’s code.
  • Can’t find the “wheel”. This is when the person takes a quick look (e.g. 1 Google search) and decides they must write it themselves.

I know several people from each of the aforementioned categories. Some are simply clueless with regard to reusability and others just have a hard head. When I first studied software engineering I loved just sitting down with a can of coke and typing as much code as possible, as fast as possible. I remember doing pair programming and having my partner comment “You are typing too fast”, and responding with a clever smile. In University I was even forced to re-implement data structures, such as linked lists, to grasp the basics of how they work. In an educational context I think this is a good idea, but not when you are working for real, i.e. in a real company. In my case, I eventually slowed down my typing and thought through what I wanted to do first, did some research, and then proceeded.

Back to the subject at hand

How many times have you wanted to validate an e-mail address?

How many times have you wanted to sanitize input?

How many times have wanted to validate anything really?

For the first question, my typical thought process would involve thinking which characters are allowed, which are not allowed, determine that a regular expression would be ideal for this situation and google it. I would then check if any PEAR packages can help out, such as Validate. If I was using a framework, such as Zend Framework, I could check out what it offers. For ZF, there are many many classes pertaining to validation and filtering. Meanwhile, our wheel reinventers would start writing a regex, invariably forgetting or simplifying the rules, or ending up with a page long regular expression. For those who Googled say “php email validation”, they are presented with over 1 million results containing spiffy regular expressions.

It gets simpler

Available since PHP 5.1 and bundled in PHP as of 5.2 (late 2006), the PHP Filter extension makes it way simpler. This extension provides several functions that allow you to do two types of filtering: validation and sanitization. Validation can be done on emails, IP’s, URL’s, regex’s, and more. Data can be sanitized based on many filters but most importantly it can be sanitized similarly to htmlentities().

1
2
3
<?php
echo filter_var('test@test.com', FILTER_VALIDATE_EMAIL); // returns test@test.com
echo filter_var('test@test', FILTER_VALIDATE_EMAIL); // returns false

How easy is that? It doesn’t get any easier than that. Actually, it can. Suppose we want to validate our data from HTTP GET or POST:

1
2
<?php
echo filter_input(INPUT_GET, 'email', FILTER_VALIDATE_EMAIL); // validates $_GET['email']

This extension also supports sanitization of data, e.g. the removal of invalid characters. This is especially useful to prevent XSS attacks and handles character encoding issues fine.

1
2
3
<?php
echo "Welcome, " . $_GET['name']; // Not good. Set $_GET['name'] = <script>alert('xss');</script>
echo "Welcome, " . filter_input(INPUT_GET, 'name', FILTER_SANITIZE_STRING); // Safe now :)

Conclusion

The filter extension provides an easy way to sanitize and validate input. The way it works may not suit everyone’s needs: The default behavior is not what everyone wants and there are quirks. Still, most of the functions allow fine grained option settings (even callbacks), making this extension easy to use yet customizable for specific needs. For some reason I don’t see many people use this extension, let alone know that it exists. For something that’s been bundled with PHP for several years this is unfortunate.

Additional reading on the filter extension: here and here.

Tags: , ,