Find and replace strings in Excel (.xlsx) using Py

2019-02-20 04:36发布

问题:

I am trying to replace a bunch of strings in an .xlsx sheet (~70k rows, 38 columns). I have a list of the strings to be searched and replaced in a file, formatted as below:-

bird produk - bird product
pig - pork
ayam - chicken
...
kuda - horse

The word to be searched is on the left, and the replacement is on the right (find 'bird produk', replace with 'bird product'. My .xlsx sheet looks something like this:-

name     type of animal     ID
ali      pig                3483
abu      kuda               3940
ahmad    bird produk        0399
...
ahchong  pig                2311

I am looking for the fastest solution for this, since I have around 200 words in the list to be searched, and the .xlsx file is quite large. I need to use Python for this, but I am open to any other faster solutions.

Edit:- added sheet example

Edit2:- tried some python codes to read the cells, took quite a long time to read. Any pointers?

from xlrd import open_workbook
wb = open_workbook('test.xlsx')

for s in wb.sheets():
    print ('Sheet:',s.name)
    for row in range(s.nrows):
        values = []
        for col in range(s.ncols):
            print(s.cell(row,col).value)

Thank you!

Edit3:- I finally figured it out. Both VBA module and Python codes work. I resorted to .csv instead to make things easier. Thank you! Here is my version of the Python code:-

import csv

###### our dictionary with our key:values. ######
reps = {
    'JUALAN (PRODUK SHJ)' : 'SALE( PRODUCT)',
    'PAMERAN' : 'EXHIBITION',
    'PEMBIAKAN' : 'BREEDING',
    'UNGGAS' : 'POULTRY'}


def replace_all(text, dic):
    for i, j in reps.items():
        text = text.replace(i, j)
    return text

with open('test.csv','r') as f:
    text=f.read()
    text=replace_all(text,reps)

with open('file2.csv','w') as w:
    w.write(text)

回答1:

I would copy the contents of your text file into a new worksheet in the excel file and name that sheet "Lookup." Then use text to columns to get the data in the first two columns of this new sheet starting in the first row.

Paste the following code into a module in Excel and run it:

Sub Replacer()
    Dim w1 As Worksheet
    Dim w2 As Worksheet

    'The sheet with the words from the text file:
    Set w1 = ThisWorkbook.Sheets("Lookup")
    'The sheet with all of the data:
    Set w2 = ThisWorkbook.Sheets("Data")

    For i = 1 To w1.Range("A1").CurrentRegion.Rows.Count
        w2.Cells.Replace What:=w1.Cells(i, 1), Replacement:=w1.Cells(i, 2), LookAt:=xlPart, _
        SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
        ReplaceFormat:=False
    Next i

End Sub


回答2:

Make 2 arrays A[bird produk, pig, ayam, kuda] //words to be changed B[bird product, pork, chicken, horse] //result after changing the word

Now check each row of your excel and compare it with every element of A. If i matches then replace it with corresponding element of B.

for example // not actual code something like pseudocode

for (i=1 to no. of rows.)
{
for(j=1 to 200)
{
if(contents of row[i] == A[j])
then contents of row[i]=B[j] ;
break;
}
}

To make it fast you have to stop the current iteration as soon as the word is replaced and check the next row.



回答3:

Similar idea to @coder_A 's, but use a dictionary to do the "translation" for you, where the keys are the original words and the value for each key is what it gets translated to.



回答4:

For reading and writing xls with Python, use xlrd and xlwt, see http://www.python-excel.org/

A simple xlrd example:

from xlrd import open_workbook
wb = open_workbook('simple.xls')

for s in wb.sheets():
    print 'Sheet:',s.name
    for row in range(s.nrows):
        values = []
        for col in range(s.ncols):
            print(s.cell(row,col).value)

and for replacing target text, use a dict

replace = {
    'bird produk': 'bird product',
    'pig': 'pork',
    'ayam': 'chicken'
    ...
    'kuda': 'horse'
}

Dict will give you O(1)(most of the time, if keys don't collide) time complexity when checking membership using 'text' in replace. there's no way to get better performance than that.

Since I don't know what your bunch of strings look like, this answer may be inaccurate or incomplete.