I have several large files (3-6 Gb) of 1's and 0's characters in ASCII and I would like to convert it to a simply binary file. Newlines are not important and should be discarded.
test.bin below is 568 bytes, I would like the 560 bit file.
0111000110000000101000100000100100011111010010101000001001010000111000
1001100011010100001101110000100010000010000000000001011000010011111100
0100001000010000010000010111011101011111000111111000111001100010100011
0011101000100001111111000001111110111111101101100000011000010101100001
0000000110110001000000000001000011110100000101101000001000010001010011
1101101111010101011110001110000010011001100101101101000111111101110101
1000001100101101010111110111110101100000000011001000100000000011001110
0101101001110010011110000100101001001111010011100100001001111111100110
...
I've found several solutions going the other way, converting a binary file into ASCII but not the other way.
Ideally I'm looking for a simple linux / bash solution but I could live with a python solution.
=================== Edit ==================
To make this less confusing consider converting any two ASCII characters into a binary file.
test_XY_encoded.txt
XYYYXXXYYXXXXXXXYXYXXXYXXXXXYXXYXXXYYYYYXYXXYXYXYXXXXXYXXYXYXXXXYYYXXX
YXXYYXXXYYXYXYXXXXYYXYYYXXXXYXXXYXXXXXYXXXXXXXXXXXXYXYYXXXXYXXYYYYYYXX
XYXXXXYXXXXYXXXXXYXXXXXYXYYYXYYYXYXYYYYYXXXYYYYYYXXXYYYXXYYXXXYXYXXXYY
XXYYYXYXXXYXXXXYYYYYYYXXXXXYYYYYYXYYYYYYYXYYXYYXXXXXXYYXXXXYXYXYYXXXXY
XXXXXXXYYXYYXXXYXXXXXXXXXXXYXXXXYYYYXYXXXXXYXYYXYXXXXXYXXXXYXXXYXYXXYY
YYXYYXYYYYXYXYXYXYYYYXXXYYYXXXXXYXXYYXXYYXXYXYYXYYXYXXXYYYYYYYXYYYXYXY
YXXXXXYYXXYXYYXYXYXYYYYYXYYYYYXYXYYXXXXXXXXXYYXXYXXXYXXXXXXXXXYYXXYYYX
XYXYYXYXXYYYXXYXXYYYYXXXXYXXYXYXXYXXYYYYXYXXYYYXXYXXXXYXXYYYYYYYYXXYYX
Where X represents the binary 0 and Y represents the binary 1.
How about this bash command?
cat test.bin | tr -d '\n' | perl -lpe '$_=pack"B*",$_' > true_binary.txt
'tr' will delete all newline characters, and the perl command converts to binary.
I don't know if this would solve the question, but how about this:
with open('ascii.txt', 'r') as file_ascii, open('binary.txt', 'wb') as file_bin:
file_bin.write(bytes(''.join(file_ascii.read().split()), 'utf-8'))
Or, to overwrite the file:
with open('ascii.txt', 'r') as f:
binary = bytes(''.join(file_ascii.read().split()), 'utf-8')
with open('ascii.txt', 'wb') as f:
f.write(binary)
Short, but should do the trick.
We could build an "only shell" solution.
First, we transform the 1's and 0's to an stream of 8 characters lines:
$ { cat test.bin | tr -cd '01' | fold -b8; echo; }
01110001
10000000
10100010
00001001
00011111
…
…
10011110
00010010
10010011
11010011
10010000
10011111
11100110
That's 560/8 lines, or 70 lines, which should translate to 70 characters.
It should be said that the characters are not ASCII, values above decimal 127 (hex 7f) are not ASCII. I am interpreting them as byte values (unsigned decimal value).
Then we can read each line and translate it first to decimal "$((2#$a))"
so the shell understand them, then to hex printf '\\x%x'
so the final printf could translate to an hex byte printf '%b' "…"
:
$ { cat infile | tr -cd '01' | fold -b8; echo; } |
while read a; do printf '%b' "$(printf '\\x%x' "$((2#$a))")"; done
q�� J�P�cP�XO�!u���(Έ�큅a���OoU�f[G�X2���Ȁ3����Ӑ��
Of course, the characters printed are a (most probably) incorrect interpretation of the byte values in some locale that the user is using. Maybe an hex output will be more interesting (but that depends on your needs or interest):
$ { cat infile | tr -cd '01' | fold -b8; echo; } |
while read a; do printf '%b' "$(printf '\\x%x' "$((2#$a))")"; done |
od -vAn -tx1c
71 80 a2 09 1f 4a 82 50 e2 63 50 dc 22 08 00 58
q 200 242 \t 037 J 202 P 342 c P 334 " \b \0 X
4f c4 21 04 17 75 f1 f8 e6 28 ce 88 7f 07 ef ed
O 304 ! 004 027 u 361 370 346 ( 316 210 177 \a 357 355
81 85 61 01 b1 00 10 f4 16 82 11 4f 6f 55 e3 82
201 205 a 001 261 \0 020 364 026 202 021 O o U 343 202
66 5b 47 f7 58 32 d5 f7 d6 00 c8 80 33 96 9c 9e
f [ G 367 X 2 325 367 326 \0 310 200 3 226 234 236
12 93 d3 90 9f e6
022 223 323 220 237 346
Note that the same structure could be used for the file test_XY_encoded.txt
:
$ { cat infile | tr 'XY' '01' | tr -cd '01' | fold -b8; echo; } |
while read a; do printf '%b' "$(printf '\\x%x' "$((2#$a))")"; done |
od -vAn -tx1c
71 80 a2 09 1f 4a 82 50 e2 63 50 dc 22 08 00 58
q 200 242 \t 037 J 202 P 342 c P 334 " \b \0 X
4f c4 21 04 17 75 f1 f8 e6 28 ce 88 7f 07 ef ed
O 304 ! 004 027 u 361 370 346 ( 316 210 177 \a 357 355
81 85 61 01 b1 00 10 f4 16 82 11 4f 6f 55 e3 82
201 205 a 001 261 \0 020 364 026 202 021 O o U 343 202
66 5b 47 f7 58 32 d5 f7 d6 00 c8 80 33 96 9c 9e
f [ G 367 X 2 325 367 326 \0 310 200 3 226 234 236
12 93 d3 90 9f e6
022 223 323 220 237 346