Wireless

The theory of anonymous big data

May 15, 2013 2:42am

ITEM: Researchers claim to have developed a model for crunching the Big Data generated from cellular networks without revealing the identities of mobile users.

The research team from AT&T, Rutgers University, Princeton, and Loyola University has built a “mobility model” of Los Angeles and New York City, using location data points from mobile voice calls and text messages on AT&T’s network in those two cities. The model aggregates the data, produces representative “synthetic call records”, then mathematically obscures any data that could identify people, reports Technology Review:

The new approach starts by aggregating traces of real human movements, then identifying common locations that might indicate home, work, or school. Next, it creates a set of transportation models. These models generate route tracks of people that the researchers call “synthetic,” because they are merely representative of the aggregate data, and not of actual people.

But the third part is the key. Even these supposedly synthetic records can closely match real ones (especially when the underlying aggregate sample is small). So an algorithm, using an emerging technique known as differential privacy, calculates exactly how high this risk is, and how to reduce it by altering the data.

In other words, you can inject “noise” into the model, such as changing the aggregated home and work locations or call times to reduce reliance on the data of a single user.

That’s key because other research has already demonstrated that it’s possible to take anonymous mobile user data and pin down a person’s name and address with it. In March, researchers at MIT and the Université Catholique de Louvain in Belgium took data from a million and a half mobile users and managed to identify 95% of them using just four location reference points.

The question, of course, is to what extent the above model will work in the real world, or how long it will take for someone to find a way around it.

For that matter, there’s also the question of how closely cellcos will follow that model, which will depend on things like the local regulatory environment and whether they can monetize anonymous data as effectively as identifiable data. While anonymous data is theoretically useful for things like street traffic planning and mapping out things like ethnic divides, malaria outbreaks and poverty levels, there’s not necessarily a commercial business model in there.

Also, there’s arguably competitive pressure from OTT internet players like Google and Facebook who are already gathering tons of user data for the benefit of their advertising customers (and, sometimes, whatever government agencies might want access to it). If they aren’t keeping all their data anonymous, why should cellcos be expected to do so?

Meanwhile, while we’re on Big Data, the commercial value therein and the level of anonymity it provides, you might want to check out this piece from Kate Crawford of the MIT Centre for Civic Media, which addresses five myths about Big Data, including this one: "Big Data Is Anonymous, so It Doesn't Invade Our Privacy."

Crawford’s verdict: “Flat-out wrong.”