MID/DLE: Distributed Privacy

The Problem.

As data collection becomes more and more intensive (e.g. through projects like Wear-IT), the problem of privacy becomes more and more intense.

Good scientific practice suggests that we should always share data, but our responsibility to our participants suggests that we should keep it private.

Two paths, and a MIDDLE way.

Traditionally, the argument about privacy suggests only two options:

  • Give up privacy and let the data be used.
  • Keep privacy, and lose out on services, science, etc.

We suggest that there’s a third way.  The MIDDLE way.

Maintained Individual Data with Distributed Likelihood Evaluation.

The MIDDLE project starts with a simple question:

If this is data about you, why should I even have it?

The simple answer to this question is that I shouldn’t.  You should control all the data there is about you; end of story.  That’s Maintained Individual Data, the MID of MIDDLE.  You keep your own data.

The usual response to this is that it’s not possible for me to use that data (e.g. to provide you with customized search results or targeted ads, or to do science) unless you give me access to it. But that’s only kind of true.

Distributed Likelihood Evaluation is a way of answering questions about data without actually having direct access to the data.  Essentially, if I have a theory about how the world works, I can formalize that with a model.  For example, a Structural Equation Model in OpenMx.  Instead of collecting a lot of data in one place to see how well it fits my model, I can pass the model around to you instead.

The DLE of MIDDLE is a set of protocols for doing this model passing.  The result is that I can tell how well my model fits to everybody without having to know anything about you specifically.

The Firewall Problem.

One of the harder problems facing MIDDLE is what we call the Firewall Problem.  That is, sometimes all the data you need isn’t in the same place.  In some cases (e.g. if we have a bunch of randomly-selected people that participate in our experiment), that’s ok–it’s easily handled by the MIDDLE protocol.

But there are harder cases.  What about cases where we need a person’s medical records from the hospital, data about their service in the armed forces from the VA, educational records from their University, and data about where they live from the census.  Each of those records is tightly protected by the respective institution and enforced by regulation and law, so it’s very difficult to access all of them at once.

Our solution (now in press!) uses privacy-preserving mathematics to solve the firewall problem for arbitrary arrangements of data in the SEM framework.  There’s more work to be done, but we’ve got a working prototype and are working hard to see where it breaks.

Some day soon, we’re going to need a solution like this for everybody.  And MIDDLE is a first step on having one when that happens.