Tikalon Blog is now in archive mode.
An easily printed and saved version of this article, and a link
to a directory of all articles, can be found below: |
This article |
Directory of all articles |
Data Mining Emigration Rate
July 13, 2012
The
Internet has enabled quite a few things. Foremost of these is a new
paradigm for
commerce, but buried among all the
digital debris is a new investigational method called
data mining. In data mining, large volumes of data are analyzed to discover unexpected
correlations, or quantify a suspected correlation.
Data mining has become an important specialty, and I've mentioned it in quite a few previous articles:
•
Data, Data, Everywhere (February 7, 2007) »
The
Exchangeable Image File Format (EXIF) data of
digital photographs on the
Flickr photo-sharing web site were used to rank the most popular cameras. Similar data were used to determine whether people preferred
sunrise or
sunset (sunset is slightly more popular than sunrise). It was also possible to extract from these data reasonable curves of the times for sunrise and sunset for the
narrow band of latitudes in which the majority of
Earth's population resides.
•
Basic Research (October 22, 2010) »
I repeated a data-mining exercise, first done by
Roger Pielke, Jr., of the
Center for Science and Technology Policy Research of the
University of Colorado at Boulder. I looked for the occurrence of the phrase, "
basic research," in
The New York Times as a function of time. Interest in basic research peaked quite suddenly after it was realized that
technology derived from what had been very basic research had won
World War II. There was an additional surge after
Sputnik was launched. Today, basic research seems to be in retreat.
•
Culturomics (January 13, 2011) »
The
Ngram Viewer from
Google is a way to examine the
frequency of occurrence of words as a function of time for the many books and other publications that Google has indexed. The Ngram concept has spawned a web site,
www.culturomics.org with analyses of
cultural trends. Such analysis found that 8,500 new words enter the
English language annually, but many of these aren't found in
dictionaries. It was also discovered that the rate of mention of individual
inventions was about twice as fast at the end of the
nineteenth century than at its start.
•
Hedonometrics (January 31, 2011) »
Data mining of
Twitter accounts solved a question that I had since childhood; namely, when do most families have
dinner. Six PM is the time most people have dinner, although anytime from 5:00 PM - 7:00 PM is nearly as likely. A research team has tracked words expressing
happiness from 50 million Twitter accounts to create a "Hedonometer" that shows the instantaneous happiness index of Twitter users; and, by inference, the world at large.
•
Numb3rs (June 6, 2012) »
Remote sensing of the
Earth via
satellite has revealed an interesting correlation between the
density of
trees and the income level of
urban neighborhoods. Each percent increase in
per capita income correlates with an increased tree cover of 1.76 percent.
People have been especially mobile in the past few centuries. The
United States was populated by people from other
countries, and there is considerable
emigration, today, of people between countries.
Countries are more likely to keep track of those who enter than those who leave. Data on which people enter a country, as shown in the following figure, are often easy to find, but data on how many people leave specific countries are not tabulated.
Distribution of emigration of Poles (More red = greater numbers). There are many people of Polish origin in the United States. My maternal grandparents were both Polish immigrants to the US. (Via Wikimedia Commons))
To tackle the problem of accurately determining emigration rates,
scientists from the
Max Planck Institute for Demographic Research (MPIDR,
Rostock, Germany) and
Yahoo! Research used a data mining technique involving
email.[1-3] The technique allowed assemblage of emigration statistics for nearly every country of the world. These mined data included the emigrant's
gender and age, something that's rarely possible with official statistics for emigration.
Emilio Zagheni of MPIDR, and one of the authors of the study presented at a meeting of the
Association for Computing Machinery (ACM),[3] summarized the problems with official records: the data are outdated and inconsistent; official records are difficult to use; emigrants tend not to leave a trail; and there is also no clear definition of whom should be called a migrant.[1-2]
The data mining idea is simple - You are from where you email. Zagheni and co-author
Ingmar Weber of Yahoo! Research used the
IP address of email messages of 43 million anonymous Yahoo! accounts between September 2009 and June 2011 for
geolocation.[1-2] These accounts contained the self-reported
birthdate and gender of the sender.
When an account-holder started to send emails exclusively from a different location, it was presumed that the person had moved. The subject and content of a message were not accessed, and the account-holders were kept anonymous, a feature of this study that pleases me and other
Internet privacy advocates.[4]
The study produced the first data of US emigration by age and gender, as shown in the figure. Said Zagheni, "In the U.S., many statistics are collected about people who move into the country, but there is no system that keeps track of people who move out."[1]
US emigration rate, 2009-2011, as determined by the data mining technique of ref. 3. [3] (Graph rendered by the author using Gnumeric)
The study also addresses the interesting example of mobility across the
US-Mexican border. Emigrants from
Mexico to the US generally spent time in the US before their move, or visited Mexico shortly after their move to the US. People in their 30s were more likely to emigrate from Mexico to the US than people in their fifties, or older.[3]
There was, of course, considerable manipulation of the data to remove
spammers, etc.; and to adjust for the fact that older people don't email as often as younger people.[1-2] The
algorithms for this were well considered, so these data appear reliable.
References:
- You are where you e-mail: Global migration trends discovered in email data, Max Planck Institute for Demographic Research Press Release, June 25, 2012.
- You are where you e-mail: Global migration trends discovered in email data, Max Planck Institute Press Release, June 25, 2012.
- Emilio Zagheni and Ingmar Weber, "You are where you E-mail: Using E-mail Data to Estimate International Migration Rates," ACM Web Science Conference Proceedings, June 25, 2012 (PDF File).
- Tikalon has been a member of the Electronic Frontier Foundation for quite a few years.
- Internet use by age and sex, United Nations Economic Commission for Europe Web Site.
Permanent Link to this article
Linked Keywords: Internet; paradigm; commerce; digital; data mining; correlation; Exchangeable Image File Format; digital photography; Flickr; sunrise; sunset; narrow band of latitude; Earth's population; Roger Pielke, Jr.; Center for Science and Technology Policy Research; University of Colorado at Boulder; basic research; The New York Times; technology; World War II; Sputnik; Culturomics; Ngram Viewer; Google; frequency of occurrence; www.culturomics.org; culture; cultural; English language; dictionary; invention; nineteenth century; Hedonometrics; Twitter; dinner; happiness; Numb3rs; remote sensing; Earth; satellite; density; tree; urban neighborhood; per capita income; United States; country; emigration; Wikimedia Commons; scientist; Max Planck Institute for Demographic Research; Rostock, Germany; Yahoo! Research; email; gender; Emilio Zagheni; Association for Computing Machinery; Ingmar Weber; IP address; geolocation; birthdate; Internet privacy; Gnumeric; US-Mexican border; Mexico; spammer; algorithm.