i have this tsv file containing some paths of links each link is seperated by a ';' i want to use:
In the example below we can se that the text in the file is seperated
and i only want to read through the last column wich is a path starting with '14th'
6a3701d319fc3754 1297740409 166 14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade NULL
3824310e536af032 1344753412 88 14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade 3
415612e93584d30e 1349298640 138 14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade
I want to somehow split the path into a chain like this:
['14th_century', 'Niger', 'Nigeria'....]
how do i read the file and remove the first 3 columns so i only got the last one ?
UPDATE:
i have tried this now:
import re
with open('test.tsv') as f:
lines = f.readlines()
for line in lines[22:len(lines)]:
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
e_line = line.split(' ')
real_line = e_line[0]
print real_line.split(';')
But the problem is that it not deleting the first 3 columns ?
If the separator betweeen first is only a space and not a serie of spaces or a tab, you could do that
with open('file_name') as f:
lines = f.readlines()
for line in lines:
e_line = line.split(' ')
real_line = e_line[3]
print real_line.split(';')
Answer to your updated question.
But the problem is that it not deleting the first 3 columns ?
There are several mistakes.
Your code:
import re
with open('test.tsv') as f:
lines = f.readlines()
for line in lines[22:len(lines)]:
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
e_line = line.split(' ')
real_line = e_line[0]
print real_line.split(';')
This line does nothing...
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
Because re.sub
function doesn't change your line
variable, but returns replaced string.
So you may want to do as below.
line = re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
And your regexp ^s\+
matches only string which starts with whitespaces or tabs. Because you use ^
.
But I think you just want to replace consective whitespaces or tabs with one space.
So then, above code will be as below.(Just remove ^
in the regexp)
line = re.sub(r"\s+", " ", line, flags = re.MULTILINE)
Now, each string in line are separated just one space. So line.split(' ')
will work as you want.
Next, e_line[0]
returns first element of e_line
which is 1st column of the line.
But you want to skip first 3 columns and get 4th column. You can do like this:
e_line = line.split(' ')
real_line = e_line[3]
OK. Now entire code is look like this.
for line in lines:#<---I also changed here because there is no need to skip first 22 lines in your example.
line = re.sub(r"\s+", " ", line)
e_line = line.split(' ')
real_line = e_line[3]
print real_line
output:
14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade
P.S:
This line can become more pythonic.
before:
for line in lines[22:len(lines)]:
after:
for line in lines[22:]:
And, you don't need to use flags = re.MULTILINE
, because line
is single-line in the for-loop.
You don't need to use regex for this. The csv module can handle tab-separated files too:
import csv
filereader = csv.reader(open('test.tsv', 'rb'), delimiter='\t')
path_list = [row[3].split(';') for row in filereader]
print(path_list)