I have a txt file which looks like below including 4 rows as an example and each row strings are separated by a ,
.
"India1,India2,myIndia "
"Where,Here,Here "
"Here,Where,India,uyete"
"AFD,TTT"
https://gist.github.com/anonymous/cee79db7029a7d4e46cc4a7e92c59c50
the file can be downloaded from here
I want to extract all unique cells across all , the output2
India1
India2
myIndia
Where
Here
India
uyete
AFD
TTT
I tried to read line by line and print it ìf i call my data as df`
myfile = open("df.txt")
lines = myfile.readlines()
for line in lines:
print lines
Option 1: .csv
, .txt
Files
Native Python is unable to read .xls
files. If you convert your file(s) to .csv
or .txt
, you can use the csv
module within the Standard Library:
# `csv` module, Standard Library
import csv
filepath = "./test.csv"
with open(filepath, "r") as f:
reader = csv.reader(f, delimiter=',')
header = next(reader) # skip 'A', 'B'
items = set()
for line in reader:
line = [word.replace(" ", "") for word in line if word]
line = filter(str.strip, line)
items.update(line)
print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']
Option 2: .xls
, .xlsx
Files
If you want to retain the original .xls
format, you have to install a third-party module to handle Excel files.
Install xlrd
from the command prompt:
pip install xlrd
In Python:
# `xlrd` module, third-party
import itertools
import xlrd
filepath = "./test.xls"
with xlrd.open_workbook(filepath) as workbook:
worksheet = workbook.sheet_by_index(0) # assumes first sheet
rows = (worksheet.row_values(i) for i in range(1, worksheet.nrows))
cells = itertools.chain.from_iterable(rows)
items = list({val.replace(" ", "") for val in cells if val})
print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']
Option 3: DataFrames
You can handle csv and text files with pandas DataFrames. See documentation for other formats.
import pandas as pd
import numpy as np
# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
filepath = "./test2.txt"
# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True) # remove special chars
stack = df.stack()
clean_df = pd.Series(stack.unique())
clean_df
DataFrame Output
0 India1
1 India2
2 myIndia
3 Where
4 Here
5 India
6 uyete
7 AFD
8 TTT
dtype: object
Save as Files
# Save as .txt or .csv without index, optional
# target = "./output.csv"
target = "./output.txt"
clean_df.to_csv(target, index=False)
Note: Results from options 1 & 2 can be converted to unordered, pandas columnar objects too with pd.Series(list(items))
.
Finally: As a Script
Save any of the three options above in a function (stack
) within a file (named restack.py
). Save this script to a directory.
# restack.py
import pandas as pd
import numpy as np
def stack(filepath, save=False, target="./output.txt"):
# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True) # remove special chars
stack = df.stack()
clean_df = pd.Series(stack.unique())
if save:
clean_df.to_csv(target, index=False)
print("Your results have been saved to '{}'".format(target))
return clean_df
if __name__ == "__main__":
# Set up input prompts
msg1 = "Enter path to input file e.g. ./test.txt: "
msg2 = "Save results to a file? y/[n]: "
try:
# Python 2
fp = raw_input(msg1)
result = raw_input(msg2)
except NameError:
# Python 3
fp = input(msg1)
result = input(msg2)
if result.startswith("y"):
save = True
else:
save = False
print(stack(fp, save=save))
From its working directory, run the script via commandline. Answer the prompts:
> python restack.py
Enter path to input file e.g. ./test.txt: ./@data/test2.txt
Save results to a file? y/[n]: y
Your results have been saved to './output.txt'
Your results should print in you console and optionally save to a file output.txt
. Adjust any parameters to suit your interests.
If your stack.txt
file looks like this (i.e. it's saved as a .txt
file):
"India1,India2,myIndia "
"Where,Here,Here "
"Here,Where,India,uyete"
"AFD,TTT"
The solution:
from collections import OrderedDict
with open("stack.txt", "r") as f:
# read your data in and strip off any new-line characters
data = [eval(line).strip() for line in f.readlines()]
# get individual words into a list
individual_elements = [word for row in data for word in row.split(",")]
# remove duplicates and preserve order
uniques = OrderedDict.fromkeys(individual_elements)
# convert from OrderedDict object to plain list
final = [word for word in uniques]
To get your desired columnar output:
print("\n".join(final))
Which yields:
India1
India2
myIndia
Where
Here
India
uyete
AFD
TTT
I won't give you the whole code, but I'll give you some ideas.
First, you need to read all the lines of the file:
lines = open("file.txt").readlines()
Then, extract the data from each line:
lines = [line.split(",") for line in lines]
You can generate combinations with itertools.combinations
. For each line, print the combinations of the line's elements.
You can get the unique elements with set
if you don't care about the order of the elements. Before using set
, you should flatten the list lines
first, maybe using itertools.chain.from_iterable
.
Your code for reading the text file line by line is fine. So you still need to
- Split each line into "cells"
- Remove duplicates
You can split each line into cells using split
line.split(',')
And you want to remove white space, so I would strip
each cell:
[value.strip() for elem in line.split(',')]
And you can remove duplicates with set
set(cells)
Lastly, I think it's better to use with
(a context manager) when reading files. Putting it all together:
with open('df.txt', 'r') as f:
cells = []
for line in f:
cells += [value.strip() for value in line.split(',')]
cells = list(set(cells))
If you want to be more compact, you can do it in a single list comprehension:
with open('df.txt', 'r') as f:
cells = list(set([value.strip() for line in f for value in line.split(',']))