python chain a list from a tsv file

2019-03-01 12:59发布

问题:

i have this tsv file containing some paths of links each link is seperated by a ';' i want to use:

In the example below we can se that the text in the file is seperated and i only want to read through the last column wich is a path starting with '14th'

6a3701d319fc3754    1297740409  166    14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade    NULL
3824310e536af032    1344753412  88     14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade  3
415612e93584d30e    1349298640  138    14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade

I want to somehow split the path into a chain like this:

['14th_century', 'Niger', 'Nigeria'....] 

how do i read the file and remove the first 3 columns so i only got the last one ?

UPDATE:

i have tried this now:

import re
with open('test.tsv') as f:
    lines = f.readlines()
for line in lines[22:len(lines)]:
    re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
    e_line = line.split(' ')
    real_line = e_line[0]
    print real_line.split(';')

But the problem is that it not deleting the first 3 columns ?

回答1:

If the separator betweeen first is only a space and not a serie of spaces or a tab, you could do that

with open('file_name') as f:
    lines = f.readlines()
for line in lines:
    e_line = line.split(' ')
    real_line = e_line[3]
    print real_line.split(';')


回答2:

Answer to your updated question.

But the problem is that it not deleting the first 3 columns ?

There are several mistakes.

Your code:

import re
with open('test.tsv') as f:
    lines = f.readlines()
for line in lines[22:len(lines)]:
    re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
    e_line = line.split(' ')
    real_line = e_line[0]
    print real_line.split(';')

This line does nothing...

re.sub(r"^\s+", " ", line, flags = re.MULTILINE)

Because re.sub function doesn't change your line variable, but returns replaced string. So you may want to do as below.

line = re.sub(r"^\s+", " ", line, flags = re.MULTILINE)

And your regexp ^s\+ matches only string which starts with whitespaces or tabs. Because you use ^. But I think you just want to replace consective whitespaces or tabs with one space. So then, above code will be as below.(Just remove ^ in the regexp)

line = re.sub(r"\s+", " ", line, flags = re.MULTILINE)

Now, each string in line are separated just one space. So line.split(' ') will work as you want.

Next, e_line[0] returns first element of e_line which is 1st column of the line. But you want to skip first 3 columns and get 4th column. You can do like this:

e_line = line.split(' ')
real_line = e_line[3]

OK. Now entire code is look like this.

for line in lines:#<---I also changed here because there is no need to skip first 22 lines in your example.
    line = re.sub(r"\s+", " ", line)
    e_line = line.split(' ')
    real_line = e_line[3]
    print real_line

output:

14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade

P.S:

This line can become more pythonic.

before:

for line in lines[22:len(lines)]:

after:

for line in lines[22:]:

And, you don't need to use flags = re.MULTILINE, because line is single-line in the for-loop.



回答3:

You don't need to use regex for this. The csv module can handle tab-separated files too:

import csv

filereader = csv.reader(open('test.tsv', 'rb'), delimiter='\t')
path_list = [row[3].split(';') for row in filereader]

print(path_list)