extract data at specific columns in a line if ther

2019-09-20 04:12发布

问题:

I have a file with lines of data like below I need to pull out the characters at 74-79 and 122-124 some lines will not have any character at 74-79 and I want to skip those lines.

import re
    def main():
        file=open("CCDATA.TXT","r")
        lines =file.readlines()
        file.close()

        for line in lines:
            lines=re.sub(r" +", " ", line)
            print(lines)


    main()
CF214L214L1671310491084111159          Customer Name                     46081                 171638440 0000320800000000HCCCIUAW    0612170609170609170300000000003135                                                              
CF214L214L1671310491107111509          Customer Name                     46144                 171639547 0000421200000000DRNRIUAW    0612170613170613170300000000003135                                                              
CF214L214L1671380999999900002000007420                                                                                                                                                                                           
CF214L214L1671310491084111159          Customer Name                     46081                 171638440 0000320800000000DRCSIU      0612170609170609170300000000003135                                                              
CF214L214L1671380999999900001000003208                                                                                                                                                                                           
CF214L214L1671510446646410055          Customer Name                     46436                 171677320 0000027200000272AA          0616170623170623170300000050003001                                                              
CF214L214L1671510126566110169          Customer Name                     46450                 171677321 0000117900001179AA          0616170623170623170300000250003001                                                              
CF214L214L1671510063942910172          Customer Name                     46413                 171677322 0000159300001593AA          0616170623170623170300000150003001                                                              
CF214L214L1671510808861010253          Customer Name                     46448                 171677323 0000298600002986AA          0616170623170623170300000350003001                                                              
CF214L214L1671510077309510502          Customer Name                     46434                 171677324 0000294300002943AA          0616170622170622170300000150003001                                                              
CF214L214L1671580999999900029000077728                                                                                                                                                                                           
CF214L214L1671610049631611165          Customer Name                     46221                 171677648 0000178700000000            0616170619170619170300000000003000                                                              
CF214L214L1671610895609911978          Customer Name                     46433                 171677348 0000011800000118AC          0616170622170622170300000150003041                                                              
CF214L214L1671680999999900002000001905 

回答1:

Short answer:

Just take line[74:79] and such as Roelant suggested. Since the lines in your input are always 230 chars long though, there'll never be an IndexError, so you rather need to check if the result is all whitespace with isspace():

field=line[74:79]
<...>
if isspace(field): continue

A more robust approach that would also validate input (check if you're required to do so) is to parse the entire line and use a specific element from the result.

One way is a regex as per Parse a text file and extract a specific column, Tips for reading in a complex file - Python and an example at get the path in a file inside {} by python .

But for your specific format that appears to be an archaic, punchcard-derived one, with column number defining the datum's meaning, the format can probably be more conveniently expressed as a sequence of column numbers associated with field names (you never told us what they mean so I'm using generic names):

fields=[
    ("id1",(0,39)),
    ("cname_text":(40,73)),
    ("num2":(74:79)),
    ("num3":(96,105)),
    #whether to introduce a separate field at [122:125]
    # or parse "id4" further after getting it is up to you.
    # I'd suggest you follow the official format spec.
    ("id4":(106,130)),
    ("num5":(134,168))
]
line_end=230

And parsed like this:

def parse_line(line,fields,end):
    result={}
    #for whitespace validation
    # prev_ecol=0
    for fname,(scol,ecol) in format.iteritems():
        #optionally validate delimiting whitespace
        # assert prev_ecol==scol or isspace(line[prev_ecol,scol])
        #lines in the input are always `end' symbols wide, so IndexError will never happen for a valid input
        field=line[scol:ecol]
        #optionally do conversion and such, this is completely up to you
        field=field.rstrip(' ')
        if not field: field=None
        result[fname]=field
        #for whitespace validation
        # prev_ecol=ecol
    #optionally validate line end
    # assert ecol==end or isspace(line[ecol:end])

All that leaves is skip lines where the field is empty:

for line in lines:
    data = parse_line(line,fields,line_end)
    if any(data[fname] is None for fname in ('num2','id4')): continue

    #handle the data  


回答2:

def read_all_lines(filename='CCDATA.TXT'):
    with open(filename,"r") as file:
        for line in file:
            try:
                first = line[74:79]
                second = line[122:124]
            except IndexError:
                continue  # skip line
            else:
                do_something_with(first, second)

Edit: Thanks for commenting, apparently it should have been:

for line in file:
     first = line[74:79]
     second = line[122:124] 
     if set(first) != set(' ') and set(second) != set(' '):
          do_something_with(first, second)