Method for creating location database based on logged query information

Method for creating location database based on logged query information

Post by foobar12 » Sat, 29 Sep 2007 07:26:29


Here we have a web service that accepts as inputs the estimated
location of a web site visitor (such as their billing
or shipping address associated with a credit card) and the
associated IP address of the web site visitor. This webservice
returns the distance between the IP address location and the entered
location, among other outputs. The IP address/location
pairs are extracted from the log files from this webservice,
and used to construct a database associated IP netblocks with
estimated locations. First, possibly incorrect IP/location
pairs are removed, for instance if the location is associated
with a fake or fraudulent signup or transaction, or if it is
believed the billing address is not indicative of the true
location.

To construct the database, we group IP addresses that should
be near each other, for example by
subnet (e.g. 99.99.99.0/24) or by last router hop on the traceroute.
For each group of IP addresses, we calculate the median or average
and remove outlying IP address/location pairs, and re-calculate
the median or average. We then select the nearest town or
city to the median or average, possibly favoring a city with
a larger population or with more frequent occurances in the
IP/location pairs. Besides estimating
the most likely location for an IP address, we can determine the
expected error of the IP resolution, the confidence and likelihood of
an IP address being at the reported
location, as well as the estimated difference between the actual
location and the reported location, using standard statistical
methods.
This can be extended to return
a list of possible locations, with associated probability,
for any given IP address. For example, a IP address could be cycled
through 10 nearby towns, would have a 12% probability of being
assigned
to a end-user in town A, 11% probability to town B, and so on. Or an
IP address
could have a 40% probability of being assigned to a end-user in
Germany, 25% probability of being
in the UK, 20% probability of being in France, and 15% probability of
being in Italy. These probability distributions can be determined
by running statistical analysis on the IP address and estimated
end-user location pairs, grouped by nearby IP addresses. Adjustments
can be made if the sample data is skewed towards having more data from
certain locations. From this
probability distribution, various indicators such as confidence level
and expected error can be derived.
 
 
 

Method for creating location database based on logged query information

Post by foobar12 » Sat, 29 Sep 2007 07:41:31

In addition the user entered location data can be overlayed with data
from whois sources. The whois data can fill the gaps where the user
entered location data is not available or not reliable.