upPIR is a protocol by which one can privately retrieve information from a set of mirrors. The idea is that the vendor produces information, but neither the vendor nor the mirrors know which specific information is requested by the client. They only learn that the client requested some piece of information on the mirror.
To see why this is useful, let's go through an example. For example, suppose an administrator for an Apache web server finds out there is a security update they should install. For many distributions, they will contact a mirror to retrieve this update. However, mirrors are run by external, untrusted parties by many distributions. Thus, when the administrator issues the request to get a patch for Apache, they are notifying an untrusted party that they have an unpatched security flaw! Ideally, there would be some way for the client to request the update from the mirror without the mirror knowing which update is being requested.
This problem crops up in many other domains. People may be uncomfortable with their company or ISP knowing which information they are looking up on WebMD. Similarly, individuals may want to obtain information that would otherwise give away their sexual orientation, religious beliefs, political beliefs, or other sensitive data. By packaging this information with other information that could be potentially be requested, the user's preference can be masked.
How could this possibly work?
One question many people have when hearing of the problem, is whether providing this property is even possible. First, it's is clear there is a trivial solution --- just download the entire mirror's contents. However, this is very inefficient. Can we do better?
Suppose a client wants to privately retrieve an update from two mirrors so that neither mirror knows which update the client is retrieving. Let’s suppose that a vendor has produced a release containing 4 updates of the same size called A, B, C, and D. This release is currently being served by the two mirrors. The client will generate a random string of bits of length 4 (one bit for each update in the release) and send this to the first mirror. The mirror receives the random string and then will XOR together a response containing each update where there is a 1 value in that position. For example, the string 1010 would contain A XOR C. Then the client takes the random string they sent to the first mirror and flips the bit of the update they want. Let’s suppose the client wanted update D, if the first string was 1010, they would send 1011 to the second mirror. The second mirror sends back a response that contains all of the updates with a 1 XORed together (or A XOR C XOR D in our example). The client can then XOR the retrieved strings together to obtain the update D.
In our simple example, we dealt with a client that is retrieving an update from two mirrors. However, this only provides protection if those two mirrors do not collude. If a client wishes to protect against N-1 mirrors, the client can generate N-1 random bit strings to send to the first N-1 mirrors. The N th mirror should be sent the first N-1 strings XORed together with the bit for the desired update flipped. The client can XOR all N results together to obtain the result.
Each mirror receives a random string of bits (or a random string with an unknown bit flipped) and so gains no information about what update is being retrieved. Only by sharing the bit strings could the mirrors discover which update the client is retrieving. However, the client’s ISP or access point could simply read all of the client’s bit strings and then decode which update is desired. To protect against an adversary who can observe all network traffic, the client can communicate with each mirror over an encrypted channel.
Is this fast?
That really depends on the size of the data the mirror is storing. If that data fits in memory and the update size is chosen reasonably (between 512KB and 4MB for most systems), the client perceived latency is similar to FTP even when privately requesting content from three mirrors.
The throughput for a mirror is also comparable to popular protocols. With our C-based datastore implementation, a mirror running on two-year old hardware can serve a 2MB security update from Ubuntu (1.4GB) at better than T3 speed. More results to come.
Where can I get more information?
In the near future, we will post a paper here that explains our system in more detail. Please check back or email justinc _AT_ cs DOT washington PERIOD edu to request a copy.
How can I try out your implementation?
We are currently in the process of re-implementing our solution in preparation for public release. I will move the code to SVN, build a Trac instance, set up a ticket system, etc. when this a little closer to release. A copy of the current pre-alpha code is attached to this page. USE AT YOUR OWN RISK.
Important notes / TODOs:
I might remove the use of session in the code. I'm not sure this is really essential, but I clearly could do this over HTTP or another protocol. This might make it slightly easier to code, but probably isn't a major concern.
The client / mirror communications are currently in the clear. This will be fixed in a future release. I'm leaning toward using HTTPS for authentication and confidentiality and the RSA implementation in GPG for non-repudiation.
While the code is written to be extensible, I'm still trying to decide whether to have an explicit plug-in model. As such, the exact extension interfaces in the code may change.
26 June 2011: JSON is used now as the data serialization format.
25 May 2011: This code now uses a C-based implementation of the XOR datastore. This is not yet performance optimized, so expect a moderate performance boost in the future.
Thanks for checking out our code! Let me know if you have any questions or comments.
Justin Cappos (jcappos _AT_ poly DOT edu)