decoding algorithm wanted

2020-04-11 11:15发布

I receive encoded PDF files regularly. The encoding works like this:

  • the PDFs can be displayed correctly in Acrobat Reader
  • select all and copy the test via Acrobat Reader
  • and paste in a text editor
  • will show that the content are encoded

so, examples are:

13579 -> 3579;
hello -> jgnnq

it's basically an offset (maybe swap) of ASCII characters.

The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.

Thank you!

edit: thanks for the feedback. I'd try to break the question into smaller questions:

Part 1: How to detect identical part(s) inside string?

5条回答
贼婆χ
2楼-- · 2020-04-11 11:27

Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

查看更多
3楼-- · 2020-04-11 11:33

You need to brute-force it.

If those patterns are simple like +2 character code like in your examples (which is +2 char codes)

h i j
e f g
l m n
l m n
o p q

1 2 3
3 4 5
5 6 7
7 8 9
9 : ;

You could easily implement like this to check against knowns words

>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
...     rot=''.join(chr(ord(j)+i) for j in text)
...     for x in knowns:
...         if x in rot:
...             print rot
...
hello
查看更多
4楼-- · 2020-04-11 11:42

It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).

also this question : How would I reverse engineer a cryptographic algorithm? have some advices.

查看更多
Summer. ? 凉城
5楼-- · 2020-04-11 11:46

Hmmm, thats a tough one.

The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.

But I cannot see a solution that will decode everything for you with the scenario you describe.

Why don't you paste some sample input and we can have ago at decoding it.

查看更多
▲ chillily
6楼-- · 2020-04-11 11:51

Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?

If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.

Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.

The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.

Edit - saw comment about character swapping

In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.

Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.

查看更多
登录 后发表回答