Data Encoding for Training in Neural Network

2019-09-22 12:35发布

问题:

I have converted 349,900 words from a dictionary file to md5 hash. Sample are below:

74b87337454200d4d33f80c4663dc5e5
594f803b380a41396ed63dca39503542
0b4e7a0e5fe84ad35fb5f95b9ceeac79
5d793fc5b00a2348c3fb9ab59e5ca98a
3dbe00a167653a1aaee01d93e77e730e
ffc32e9606a34d09fca5d82e3448f71f
2fa9f0700f68f32d2d520302906e65ce
1c9b32ff1b53bd892b87578a11cbd333
26a10043bba821303408ebce568a2746
c3c32ff3481e9745e10defa7ce5b511e 

I want to train a neural network to decrypt a hash using just simple architecture like MultiLayer Perceptron. Since all hash value is of length 32, I was thingking that the number of input nodes is 32, but the problem here is the number of output nodes. Since the output are words in the dictionary, it doesn't have any specific length. It could be of various length. That is the reason why Im confused on how many number of output nodes shall I have.

How will I encode my data, so that I can have specific number of output nodes?

I have found a paper here in this link that actually decrypt a hash using neural network. The paper said

The input to the neural network is the encrypted text that is to be decoded. This is fed into the neural network either in bipolar or binary format. This then traverses through the hidden layer to the final output layer which is also in the bipolar or binary format (as given in the input). This is then converted back to the plain text for further process.

How will I implement what is being said in the paper. I am thinking to limit the number of characters to decrypt. Initially , I can limit it up to 4 characters only(just for test purposes).

My input nodes will be 32 nodes representing every character of the hash. Each input node will have the (ASCII value of the each_hash_character/256). My output node will have 32 nodes also representing binary format. Since 8 bits/8 nodes represent one character, my network will have the capability of decrypting characters up to 4 characters only because (32/8) = 4. (I can increase it if I want to. ) Im planning to use 33 nodes. Is my network architecture feasible? 32 x 33 x 32? If no, why? Please guide me.

回答1:

You could map the word in the dictionary in a vectorial space (e.g. bag of words, word2vec,..). In that case the words are encoded with a fix length. The number of neurons in the output layer will match that length.



回答2:

There's a great discussion about the possibility of cracking SHA256 hashes using neural networks in another Stack Exchange forum: https://security.stackexchange.com/questions/135211/can-a-neural-network-crack-hashing-algorithms

The accepted answer was that:

No.

Neural networks are pattern matchers. They're very good pattern matchers, but pattern matchers just the same. No more advanced than the biological brains they are intended to mimic. More thorough, more tireless, but not more sophisticated.

The patterns have to be there to be found. There has to be a bias in the data to tease out. But cryptographic hashes are explicitly and extremely carefully designed to eliminate any bias in the output. No one bit is more likely than any other, no one output is more likely to correlate to any given input. If such a correlation were possible, the hash would be considered "broken" and a new algorithm would take its place.

Flaws in hash functions have been found before, but never with the aid of a neural network. Instead it's been with the careful application of certain mathematical principles.

The following answer also makes a funny comparison:

SHA256 has an output space of 2^256, and an input space that's essentially infinite. For reference, the time since the big bang is estimated to be 5 billion years, which is about 1.577 x 10^27 nanoseconds, which is about 2^90 ns. So assuming each training iteration takes 1 ns, you would need 2^166 ages of the universe to train your neural net.