Approximate String Matching - Machine Learning [cl

2020-05-10 07:21发布

问题:

I have a requirement where my source data is in HDFS, and there is one field which contains skills of the users. Now the source file has all kind of skills attributed to a user for eg - MANAGEMENT , JAVA, HADOOP , PIG ,SQL, PERFORMANCE TUNING, C ,BUSINESS CONSULTING , SALES etc etc.....

NOW my query is that i need to build a machine learning algorithm to detect if there are some spelling mistakes in the so called skills. for eg if instead of sales the column has sals or like hadoop is misspelt as hadup. so i want to standardise these things.

How can i go about doing this?? I dont know Machine Learning, but i am willing to learn and code it . I am comfortable in working in PYTHON.

Any Suggestions how can i go about doing this?? Would really be great if you guys can pitch in Ideas !!

回答1:

There are typically two parts to such a problem: figuring out which items are likely in error, and then fixing those.

If you assume that the majority of items are spelled correctly, then finding the likely errors is pretty easy. Fixing the errors is a lot harder to automate, and it's probably impossible to do it 100% correctly in any reasonable length of time. But you might find that if you do a good job finding the errors, fixing them manually is no big deal.

To find the errors I would suggest that you make a list of each of the skills and a count of how many times each skill is referenced in the entire data set. When you're done you'll have a list like:

MANAGEMENT, 22
JAVA, 298
HADOOP, 12
HADUP, 1
SALES, 200
SALS, 1

etc. Each skill is listed along with the number of users who possess that skill.

Now, sort those by frequency and choose a threshold. Say you choose to examine more closely anything that has a frequency of 3 or less. The idea is that items that are used a very small number of times in relation to other items are probably misspellings.

Once you've identified the terms you want to examine more closely, you can determine if you'd like to automate the change or if you will do it manually. When I had to do this, I got my list of likely misspellings and manually created a file that had the misspelling and the correction. For example:

SALS,SALES
HADUP,HADOOP
PREFORMANCE,PERFORMANCE

There were a couple hundred, but manually creating the file was a whole lot faster than writing a program to figure out what the correct spelling should be.

Then I loaded that file and went through my user records, making the replacements as required.

The big time saver is finding the likely candidates for replacement. After that, fixing them is almost an afterthought.

That is, unless you really want to spend months on a research project. Then you can knock yourself out playing with edit distance algorithms, phonetic algorithms, and other stuff that might figure out that "edicit" and "etiquette" are supposed to be the same word.



回答2:

Something that works very nicely for this in the machine learning paradigm is String Matching kernels. Since these are actual kernel functions, if you want to formulate learning as an SVM they are very convenient.