How can I make an unique list cells?

2019-09-06 08:50发布

问题:

I have a txt file which looks like below including 4 rows as an example and each row strings are separated by a ,.

"India1,India2,myIndia     "
"Where,Here,Here   "
"Here,Where,India,uyete"
"AFD,TTT"

https://gist.github.com/anonymous/cee79db7029a7d4e46cc4a7e92c59c50

the file can be downloaded from here

I want to extract all unique cells across all , the output2

   India1
   India2
   myIndia
   Where
   Here
   India
   uyete
   AFD 
   TTT

I tried to read line by line and print it ìf i call my data as df`

myfile = open("df.txt")
lines = myfile.readlines()
for line in lines:
   print lines

回答1:

Option 1: .csv, .txt Files

Native Python is unable to read .xls files. If you convert your file(s) to .csv or .txt, you can use the csv module within the Standard Library:

# `csv` module, Standard Library
import csv

filepath = "./test.csv"

with open(filepath, "r") as f:
    reader = csv.reader(f, delimiter=',')
    header = next(reader)                                  # skip 'A', 'B'
    items = set()
    for line in reader:
        line = [word.replace(" ", "") for word in line if word]
        line = filter(str.strip, line)
        items.update(line)

print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']

Option 2: .xls, .xlsx Files

If you want to retain the original .xls format, you have to install a third-party module to handle Excel files.

Install xlrd from the command prompt:

pip install xlrd

In Python:

# `xlrd` module, third-party
import itertools
import xlrd

filepath = "./test.xls"

with xlrd.open_workbook(filepath) as workbook:
    worksheet = workbook.sheet_by_index(0)                 # assumes first sheet
    rows = (worksheet.row_values(i) for i in range(1, worksheet.nrows))
    cells = itertools.chain.from_iterable(rows)
    items = list({val.replace(" ", "") for val in cells if val})

print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']

Option 3: DataFrames

You can handle csv and text files with pandas DataFrames. See documentation for other formats.

import pandas as pd
import numpy as np

# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
filepath = "./test2.txt"

# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
stack = df.stack()
clean_df = pd.Series(stack.unique())
clean_df

DataFrame Output

0     India1
1     India2
2    myIndia
3      Where
4       Here
5      India
6      uyete
7        AFD
8        TTT
dtype: object

Save as Files

# Save as .txt or .csv without index, optional

# target = "./output.csv"
target = "./output.txt"
clean_df.to_csv(target, index=False)

Note: Results from options 1 & 2 can be converted to unordered, pandas columnar objects too with pd.Series(list(items)).

Finally: As a Script

Save any of the three options above in a function (stack) within a file (named restack.py). Save this script to a directory.

# restack.py
import pandas as pd
import numpy as np

def stack(filepath, save=False, target="./output.txt"):
    # Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8

    # Determines columns from the first line, so add commas in text file, else may throw an error
    df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
    df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
    stack = df.stack()
    clean_df = pd.Series(stack.unique())

    if save:
        clean_df.to_csv(target, index=False)
        print("Your results have been saved to '{}'".format(target))

    return clean_df

if __name__ == "__main__":
    # Set up input prompts
    msg1 = "Enter path to input file e.g. ./test.txt: "
    msg2 = "Save results to a file? y/[n]: "

    try:
        # Python 2
        fp = raw_input(msg1)
        result = raw_input(msg2)
    except NameError:
        # Python 3
        fp = input(msg1)
        result = input(msg2)

    if result.startswith("y"):
        save = True
    else:
        save = False

    print(stack(fp, save=save))

From its working directory, run the script via commandline. Answer the prompts:

> python restack.py 

Enter path to input file e.g. ./test.txt: ./@data/test2.txt
Save results to a file? y/[n]: y
Your results have been saved to './output.txt'

Your results should print in you console and optionally save to a file output.txt. Adjust any parameters to suit your interests.



回答2:

If your stack.txt file looks like this (i.e. it's saved as a .txt file):

"India1,India2,myIndia     "
"Where,Here,Here   "
"Here,Where,India,uyete"
"AFD,TTT"

The solution:

from collections import OrderedDict

with open("stack.txt", "r") as f:
    # read your data in and strip off any new-line characters
    data = [eval(line).strip() for line in f.readlines()]
    # get individual words into a list
    individual_elements = [word for row in data for word in row.split(",")]
    # remove duplicates and preserve order
    uniques = OrderedDict.fromkeys(individual_elements)   
    # convert from OrderedDict object to plain list
    final = [word for word in uniques]

To get your desired columnar output:

print("\n".join(final))

Which yields:

India1
India2
myIndia     
Where
Here   
India
uyete
AFD
TTT


回答3:

I won't give you the whole code, but I'll give you some ideas.

First, you need to read all the lines of the file:

lines = open("file.txt").readlines()

Then, extract the data from each line:

lines = [line.split(",") for line in lines]

You can generate combinations with itertools.combinations. For each line, print the combinations of the line's elements.

You can get the unique elements with set if you don't care about the order of the elements. Before using set, you should flatten the list lines first, maybe using itertools.chain.from_iterable.



回答4:

Your code for reading the text file line by line is fine. So you still need to

  1. Split each line into "cells"
  2. Remove duplicates

You can split each line into cells using split

line.split(',')

And you want to remove white space, so I would strip each cell:

[value.strip() for elem in line.split(',')]

And you can remove duplicates with set

set(cells)

Lastly, I think it's better to use with (a context manager) when reading files. Putting it all together:

with open('df.txt', 'r') as f:
    cells = []
    for line in f:
        cells += [value.strip() for value in line.split(',')]

cells = list(set(cells))

If you want to be more compact, you can do it in a single list comprehension:

with open('df.txt', 'r') as f:
    cells = list(set([value.strip() for line in f for value in line.split(',']))