ok so my issue is i have the string '\222\222\223\225' which is stored as latin-1 in the db. What I get from django (by printing it) is the following string, 'ââââ¢' which I assume is the UTF conversion of it. Now I need to pass the string into a function that does this operation:
strdecryptedPassword + chr(ord(c) - 3 - intCounter - 30)
I get this error:
chr() arg not in range(256)
If I try to encode the string as latin-1 first I get this error:
'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
I have read a bunch on how character encoding works, and there is something I am missing because I just don't get it!
Well its because its been encrypted with some terrible scheme that just changes the ord() of the character by some request, so the string coming out of the database has been encrypted and this decrypts it. What you supplied above does not seem to work. In the database it is latin-1, django converts it to unicode, but I cannot pass it to the function as unicode, but when i try and encode it to latin-1 i see that error.
Your first error 'chr() arg not in range(256)' probably means you have underflowed the value, because chr cannot take negative numbers. I don't know what the encryption algorithm is supposed to do when the inputcounter + 33 is more than the actual character representation, you'll have to check what to do in that case.
About the second error. you must decode() and not encode() a regular string object to get a proper representation of your data. encode() takes a unicode object (those starting with u') and generates a regular string to be output or written to a file. decode() takes a string object and generate a unicode object with the corresponding code points. This is done with the unicode() call when generated from a string object, you could also call a.decode('latin-1') instead.
As Vinko notes, Latin-1 or ISO 8859-1 doesn't have printable characters for the octal string you quote. According to my notes for 8859-1, "C1 Controls (0x80 - 0x9F) are from ISO/IEC 6429:1992. It does not define names for 80, 81, or 99". The code point names are as Vinko lists them:
The correct UTF-8 encoding of those is (Unicode, binary, hex):
The LATIN SMALL LETTER A WITH CIRCUMFLEX is ISO 8859-1 code 0xE2 and hence Unicode U+00E2; in UTF-8, that is %11000011 %10100010 or 0xC3 0xA2.
The CENT SIGN is ISO 8859-1 code 0xA2 and hence Unicode U+00A2; in UTF-8, that is %11000011 %10000010 or 0xC3 0x82.
So, whatever else you are seeing, you do not seem to be seeing a UTF-8 encoding of ISO 8859-1. All else apart, you are seeing but 5 bytes where you would have to see 8.
Added: The previous part of the answer addresses the 'UTF-8 encoding' claim, but ignores the rest of the question, which says:
You don't actually show us how intCounter is defined, but if it increments gently per character, sooner or later '
ord(c) - 3 - intCounter - 30
' is going to be negative (and, by the way, why not combine the constants and use 'ord(c) - intCounter - 33
'?), at which point,chr()
is likely to complain. You would need to add 256 if the value is negative, or use a modulus operation to ensure you have a positive value between 0 and 255 to pass tochr()
. Since we can't see how intCounter is incremented, we can't tell if it cycles from 0 to 255 or whether it increases monotonically. If the latter, then you need an expression such as:where 256 - 33 = 223, of course, and 479 = 256 + 223. This guarantees that the value passed to
chr()
is positive and in the range 0..255 for any input character c and any value of intCounter (and, because themod()
function never gets a negative argument, it also works regardless of howmod()
behaves when its arguments are negative).