I hope you won\'t close the question. Even though there are no computers involve
ID: 661941 • Letter: I
Question
I hope you won't close the question. Even though there are no computers involved, it is still about information and security, and I think that security experts are the ones who will be able to help best.
I want to do some user research. I need people to fill out a questionnaire, and then fill out another questionnaire two months later. I need to guarantee them anonymity, but it will still be very good if I can match the answers from a person from round A to answers from the same person from round B.
Even if this is done online, I don't think that I can let a computer system find out something about them so it can do the matching for me. In theory, I could ask them for a name and store its md5 hash. In practice, if I tell the participants they will be anonymous and then ask them for a name, I will lose their trust. And the beginning of a questionnaire is not a good place to educate random people about what md5 is. But to make this even harder, I think that I will do my next survey using pen and paper, for logistical reasons.
If I started giving people tokens, I think they will lose them during the two months. So the best solution I can think of is some sort of manual hash. For example, I could ask them "please fill the second and fifth letter of your surname and the day of month you were born". So my question is, how do I come up with a good function of this kind?
which data points about a person can be used? They must be guaranteed to exist (my above example breaks down if the person has a four-letter surname), highly individual (but not 100% unique), and the person must know them without having to look them up somewhere.
Is there some convenient way to calculate how many digits/letters I need to ask for to ensure a collision chance below X% in a group of Y people?
Are some of the possible data more problematic than others? For example, could it be that people would be more reluctant to write down the first letter of their surname than the second, because they think it would be easy for someone to try to look them up in a "brute force attack" and find out who they are?
How do I find out the highest level of complexity beyond which people either don't play along or start making mistakes?
Explanation / Answer
You could ask the last three digits of their cellphone number.
Just be clear about why you are doing so, and explain why they won't be traceable this way. Otherwise, they'll tell you numbers (or letters, or anything) at random, and the purpose will be defeated.
As for the probability, supposing the distribution is flat and the "token" can assume N values (in this example, that would be 998 - I think "000" might not be a valid ending in some countries), and you are asking M persons, then the probability of having a collision is 1 - (998/998)(997/998)(996/998)...((998-M)/998).
With 200 people, you have a maximum likelihood of having around 18-19 collisions, and it is very unlikely you'd get less than 10 collisions, or more than 30. Which means that you'll "recognize" 170 people out of 200.
With four digits and 200 people, you can expect 1-2 collisions; chances are negligible of getting more than 8-9 collisions.
With 500 people and four digits (or anything else that can assume around ten thousand randomly distributed values - you can get one digit from the car's license plate, one from the last number of streed address, one from last digit of year of birth, and so on) you can expect 12-13 collisions and again no real chance of getting less than 2 collisions, or more than 22.