如何从txt文件编码怪异(How to remove weird encoding from txt

2019-10-21 23:49发布

我想处理文本文件,像这样的:

http://www.sec.gov/Archives/edgar/data/789019/000119312514289961/0001193125-14-289961.txt

如果你周围的文件中看到有类似如下:

</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>EXCEL
<SEQUENCE>21
<FILENAME>Financial_Report.xlsx
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 Financial_Report.xlsx
M4$L#!!0`!@`(````(0!):[_C#0,``+!)```3``@"6T-O;G1E;G1?5'EP97-=
M+GAM;""B!`(HH``"````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M``````````````````````````````````````#,W,M.VT`4QO%]I;Z#Y6V5
M>([OK@@L>EFV2*4/,+4GQ,(W>08*;]^)N0BA%(2*U/^&B,2>\\6+G[+YSM')
M==\%5V:V[3AL0EFK,#!#/3;M<+X)?YY]795A8)T>&MV-@]F$-\:&)\?OWQV=
MW4S&!O[NP6["G7/3QRBR]<[TVJ['R0S^D^TX]]KY?^?S:-+UA3XW4:Q4'M7C
MX,S@5FY_1GA\]-EL]67G@B_7_NW;)+/I;!A\NKUP/VL3ZFGJVEH[GS2Z&IHG
M4U9W$];^SN4:NVLG^\''"*.#$_:?_'W`W7W?_:.9V\8$IWIVWW3O8T377?1[
MG"]^C>/%^OE##J0<M]NV-LU87_;^":SM-!O=V)TQKN_6R^NZU^UPG_N9^<O%
M-EI>Y(V#[+_?<O`K<\20'`DD1PK)D4%RY)`<!21'"<E107*(H@2AB"H44H5B
MJE!0%8JJ0F%5**X*!5:AR!I39(TILL8466.*K#%%UI@B:TR1-:;(&E-DC2FR
M)A19$XJL"476A")K0I$UH<B:4&1-*+(F%%D3BJPI1=:4(FM*D36ER)I29$TI
MLJ8465.*K"E%UI0B:T:1-:/(FE%DS2BR9A19,XJL&476C")K1I$UH\B:4V3-
M*;+F%%ESBJPY1=:<(FM.D36GR)I39,TILA8460N*K`5%UH(B:T&1M:#(6E!D
M+2BR%A19"XJL)476DB)K29&UI,A:4F0M*;*6%%E+BJPE1=:2(FM%D;6BR%I1
M9*THLE8462N*K!5%UHHB:T61M:+(*HI"JRB*K:(HN(JBZ"J*PJLHBJ^B*,"*
MH@@KBD*L*(RQH#H6QEA.(8O3R.)4LCB=+$XIB]/*XM2R,+TLP12S!-/,$DPU
M2S#=+,&4LP33SA),/4LP_2S!%+0$T]"2_U;1<GX?CHF6O__^`W8YYH6%+-;=
M=,:^\1*%VT-?FKS3LVE^N-EO#GKS`(_/?BZ'WZMS.H^3]1N&9O/ZIW"_0FA_
M]VKR!YG9M>9AB="A93P/$_UVHM</?+(-R.SW'S6F.3`[6O8M'?\!``#__P,`
M4$L#!!0`!@`(````(0"U53`C]0```$P"```+``@"7W)E;',O+G)E;',@H@0"

这似乎是一个excel文件? 或XBRL文档? 那是什么 ? 我该如何摆脱它(或“过程”不知何故?)这正好为成千上万行的,所以我想这是一些附加文件的某些环节的一些编码? 任何想法如何应对呢?

我想在Python用BeautifulSoup:

from bs4 import BeautifulSoup

with open("textWithHtml.txt") as markup:
    soup = BeautifulSoup(markup.read())

with open("processedText.txt", "w") as f: 
    f.write(soup.get_text().encode('utf-8'))

但并非一切都被删除,而且我注意到,在某些情况下,甚至没有所有的HTML标签被删除。有时候运行代码两次删除比被拆除首次BeautifulSoup代码在运行更多..

Answer 1:

你正在寻找的编码是uuencode 。 在Python中,你可以使用uu模块这个斑点,或简单地解码stringdata.decode('uu')

uuencode是原本是用来嵌入在电子邮件中的二进制文件遗留格式(当时只允许7位US-ASCII;格式也有一些优惠为与用自己令人眼花缭乱的字符编码一天的大铁系统的互操作) 。 这些天来,你希望看到base64在这个角色。

我张贴的答案的后续问题 ,显示了如何删除UUENCODE斑点,同时从一个文件句柄读取或遍历一堆文本行。



Answer 2:

这个问题可以有效地利用sed命令来解决这里所提供: sed命令-适用于文件夹中的所有文本文件(.txt)



文章来源: How to remove weird encoding from txt file