I would like to use scrypt to create a hash for my users' passwords and salts. I have found two references, but there are things I don't understand about them.
They use the scrypt encrypt and decrypt functions. One encrypts a random string and the other encrypts the salt (which looks wrong since only the password and not the salt is used for decryption). It looks like the decrypt function is being used to validate the password/salt as a side effect of the decryption.
Based on the little I understand, what I want is a key derivation function (KDF) rather than encryption/decryption and that the KDF is likely generated and used by scrypt for encryption/decryption. The actual KDF is used behind the scenes and I am concerned that blindly following these examples will lead to a mistake. If the scrypt encrypt/decrypt functions are used to generate and verify the password, I don't understand the role of the string being encrypted. Does its content or length matter?
You're correct - the scrypt functions those two links are playing with are the scrypt file encryption utility, not the underlying kdf. I've been slowly working on creating a standalone scrypt-based password hash for python, and ran into this issue myself.
The scrypt file utility does the following: picks scrypt's n/r/p parameters specific to your system & the "min time" parameter. It then generates a 32 byte salt, and then calls
scrypt(n,r,p,salt,pwd)
to create a 64 bytes key. The binary string the tool returns is composed of: 1) a header containing n, r, p values, and the salt encoded in binary; 2) an sha256 checksum of the header; and 3) a hmac-sha256 signed copy of the checksum, using the first 32 bytes of the key. Following that, it uses the remaining 32 bytes of the key to AES encrypt the input data.There are a couple of implications of this that I can see:
the input data is meaningless, since it doesn't actually affect the salt being used, and encrypt() generates a new salt each time.
you can't configure the n,r,p workload manually, or any other way but the awkward min-time parameter. this isn't insecure, but is a rather awkward way to control the work factor.
after the decrypt call regenerates the key and compares it against the hmac, it will reject everything right there if your password is wrong - but if it's right, it'll proceed to also decrypt the data package. This is a lot of extra work the attacker won't have to perform - they don't even have to derive 64 bytes, just the 32 needed to check the signature. This issue doesn't make it insecure exactly, but doing work your attacker doesn't is never desirable.
there is no way to configure salt key, derived key size, etc. the current values aren't that bad, but still, it's not ideal.
the decrypt utility's "max time" limitation is wrong for password hashing - each time decrypt is called, it estimates your system's speed, and does some "guessing" as to whether it can calculate the key within max time - which is more overhead your attacker doesn't have to do (see #3), but it also means decrypt could start rejecting passwords under heavy system load.
I'm not sure why Colin Percival didn't make the kdf & parameter-choosing code part of the public api, but it's infact explicitly marked "private" inside the source code - not even exported for linking. This makes me hesitant to just access it straight without a lot more study.
All in all, what is needed is a nice hash format that can store scrypt, and an implementation that exposes the underlying kdf and parameter-choosing algorithm. I'm currently working on this myself for passlib, but it hasn't seen much attention :(
Just to bottom line things though - those site's instructions are 'ok', I'd just use an empty string as the file content, and be aware of the extra overhead and issues.
Both of those references got it completely wrong. Don't muck with
encrypt
anddecrypt
: just usehash
The KDF is not directly exposed, but
hash
is close enough. (In fact, it appears to me to be even better, because it mixes the filling of a PBKDF2 sandwich.)This example code works with both python2.7 and python3.2. It uses PyCrypto, passlib, and py-scrypt, but only needs py-scrypt.
You will want to use a contstant-time comparison function like
passlib.utils.consteq
to mitigate timing attacks.You will also want to choose the parameters carefully. The defaults logN=14,r=8,p=1 mean 1 "round" using 16 MiB of memory. On a server, you probably want something more like 10,8,8 -- less RAM, more CPU. You should time it on your hardware under your expected load.