decoding algorithm wanted

I receive encoded PDF files regularly. The encoding works like this:

the PDFs can be displayed correctly in Acrobat Reader
select all and copy the test via Acrobat Reader
and paste in a text editor
will show that the content are encoded

so, examples are:

13579 -> 3579;
hello -> jgnnq

it's basically an offset (maybe swap) of ASCII characters.

The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.

Thank you!

edit: thanks for the feedback. I'd try to break the question into smaller questions:

Part 1: How to detect identical part(s) inside string?

标签： algorithm encryption decode

5条回答

贼婆χ

2楼-- · 2020-04-11 11:27

Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

0人赞添加讨论(0) 举报

傲

3楼-- · 2020-04-11 11:33

You need to brute-force it.

If those patterns are simple like +2 character code like in your examples (which is +2 char codes)

h i j
e f g
l m n
l m n
o p q

1 2 3
3 4 5
5 6 7
7 8 9
9 : ;

You could easily implement like this to check against knowns words

>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
...     rot=''.join(chr(ord(j)+i) for j in text)
...     for x in knowns:
...         if x in rot:
...             print rot
...
hello

0人赞添加讨论(0) 举报

傲

4楼-- · 2020-04-11 11:42

It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).

also this question : How would I reverse engineer a cryptographic algorithm? have some advices.

0人赞添加讨论(0) 举报

Summer. ? 凉城

5楼-- · 2020-04-11 11:46

Hmmm, thats a tough one.

The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.

But I cannot see a solution that will decode everything for you with the scenario you describe.

Why don't you paste some sample input and we can have ago at decoding it.

0人赞添加讨论(0) 举报

▲ chillily

6楼-- · 2020-04-11 11:51

Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?

If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.

Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.

The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.

Edit - saw comment about character swapping

In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.

Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.

0人赞添加讨论(0) 举报

decoding algorithm wanted

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间