Identifying hash algorithms
The Endeavour 2024-09-30
Given a hash value, an you determine what algorithm produced it? Or what algorithm probably produced it?
Obviously if a hash value is 128 bits long, then a 128-bit algorithm produced it. Such a hash value might have been produced by MD5, but not by SHA-1, because the former produces 128-bit hashes and the latter a 160-bit hash.
The more interesting question is this: given an n-bit hash, can you tell which n-bit hash algorithm it probably came from? You will find contradictory answers online. Some people confidently say no, that a hash value is for practical purposes a random selection from its range. Others say yes, and have point to software that reports which algorithm the hash probably came from. Which is it?
Some hashing software has a custom output format, such as sticking a colon at a fixed position inside the hash value. Software such as hashid
uses regular expressions to spot these custom formats.
But assuming you have a hash value that is simply a number, you cannot know which hash algorithm produced it, other than narrowing it to a list of known algorithms that produce a number of that length. If you could know, this would indicate a flaw in the hashing algorithm.
So, for example, a 160-bit hash value could come from SHA-1, or it could come from RIPEMD-160, Haval-160, Tiger-160, or any other hash function that produces 160-bit output.
To say which algorithm probably produced the hash, you need context, some sort of modeling assumption. In general SHA-1 is the most popular 160-bit hash algorithm, so if you have nothing else to go on, your best guess for the origin of a 160-bit hash value would be SHA-1. But RIPEMD is part of the Bitcoin standard, and so if you find a 160-bit hash value in the context of Bitcoin, it’s more likely to be RIPEMD. There must be contexts in which Haval-160 and Tiger-160 are more likely, though I don’t know what those contexts would be.
Barring telltale formatting, software that tells you which hash functions most likely produced a hash value is simply reporting the most popular hash functions for the given length.
For example, I produced a 160-bit hash of “hello world” using RIMEMD-160
echo -n "hello world" | openssl dgst -rmd160
then asked hashid
where it came from.
hashid '98c615784ccb5fe5936fbc0cbe9dfdb408d92f0f' Analyzing '98c615784ccb5fe5936fbc0cbe9dfdb408d92f0f' [+] SHA-1 [+] Double SHA-1 [+] RIPEMD-160 [+] Haval-160 [+] Tiger-160 [+] HAS-160 [+] LinkedIn [+] Skein-256(160) [+] Skein-512(160)
I got exactly the same output when I hashed “Gran Torino” and “Henry V” because the output doesn’t depend on the hashes per se, only their length.
Whether software can tell you where a hash probably came from depends on your notion of “probably.” If you find a 160-bit hash in the wild, it’s more likely to have come from SHA-1 than RIPEMD. But if you were to write a program to generate random text, hash it with either SHA-1 or RIPEMD, it would likely fail badly.