I\'m using the sha256 algorithm to generate a number in one of my programming pr
ID: 652958 • Letter: I
Question
I'm using the sha256 algorithm to generate a number in one of my programming projects. A sha256 hash is calculated from three strings: one the user can edit, one the user can view but not edit and one the user cannot view or see until it is changed to a new string.
From the hash, the first two characters are looked at to determine the number. Each of the 256 combinations of two letters/numbers that can appear in a sha256 hash are arbitrarily assigned a value with the following pattern. aa=1, ab=2, ac=3...fd=84, fe=85, ff=86...97=254, 98=255, 99=256: In the project a number from 1-100 is needed so all of the combinations of two characters are looked at from left to right until a number in that range is found.
After testing this by generating thousands of hashes from seemingly random inputs (with php I made the input md5(uniqid(rand()))) multiple times, the average number from 1-100 was around 51.
Any idea what is causing the number to be weighted more towards combinations from "dc" to "0d" (51-100) compared to "aa" to "da" (1-49)? Is this a flawed way to determine a random number even if the input is random? What if the player has control over one of the strings added to the input like described in the first paragraph?
Explanation / Answer
It may be that you did not actually observe an unexpected deviation from the desired mean of 50.5 for a uniform distribution over integers in [1,100]. However, you say that all combinations of two characters are considered, which I assume to mean that first you look at characters 0 and 1, then characters 1 and 2, and so on. This means that when you look at characters 1 and 2, since you've already rejected characters 0 and 1, the distribution of character 1 is necessarily biased (there are 156 combinations that lead to rejection, and 16 does not divide this evenly).
To obtain an unbiased sample, you should not reuse any characters. If you used your same scheme but didn't reuse any characters, you would most of the time generate an unbiased sample but with probability (156/256)32 (about 1 in 8 million) you would run out of characters before accepting. To reduce your chance of running out of characters, you can accept numbers 1 to 200 (but then divide by 2, rounding up). This reduces your chance of running out of characters to (56/256)32 (about 7.56e-22). You can improve on this by considering blocks of 8 characters (32-bits) or 16 characters (64-bits) at a time.
You can also simplify your method by using the usual hexidecimal convention and relying on existing hex string to integer conversion, or better yet, avoid the conversion completely since the SHA256 hash is always computed as a sequence of bytes, and then converted to a hex string.