Better way to implement a Machine Learning algorit

2019-07-20 16:33发布

问题:

I am doing some web scraping. One of the problems I have run into is that the column headings of the tables I am scraping sometimes differ in their language just enough that I am trying to use fuzzywuzzy to check their 'nearness'

My program starts with a list of labels. These labels are all of the column headings from the tables I have scraped off of the web. It also requires that I assign 'normalized column headings' to at least some - these serve as the basis for the 'learning'

matched_labels_dict{label_1:value_1,label_2:value_2,label_3:value_1, . . .}

The dictionary shows that label_1 and label_3 are synonyms, label_2 is not a synonym of those but could be a synonym of some other label in the dictionary

I also have a list of un_matched_labels

un_matched_labels=[label_324,label_325,label_326, . . .]

The number suffixes are just place holders

I have a function that uses fuzzywuzzy to generate a score comparing each label in the un_matched_labels to the labels in the matched_labels_dict. If the max_score of the matches is greater than some predetermined level (lets say 90) then the label being tested is added to the matched_labels_dict and assigned the same value that the label it matched had. So suppose I was testing label_424 in the un_matched_labels and the maximum match score of 94 occurred when it was compared to label_3, I then update the matched_labels_dict

 matched_labels_dict{label_1:value_1,label_2:value_2,label_3:value_1, label_424:value_1. . .}

Now the machine learning come into play because Suppose label_324 will have a matching score with label_424 of 91

But it has a match score for all the other labels that have value_1 as their value (label_1 and label_3) of something lower than my cutoff value (in this case 90).

label_324 will not be matched until label_424 is in the matched_labels_dict.

Since the labels are tested in order label_324 is not added because when it is tested label_424 is not in the matched_labels_dict (the testing is done sequentially).

To handle that I rerun the matching function (called do_machine_learning in the code block below).

Here is the do_machine_learning function all_labels is a list of labels, matched_label_dict is the dictionary that has known label value matches and is in the form above

def do_machine_learning(all_labels,matched_labels_dict):
for test_label in all_labels:
    if test_label not in matched_labels_dict:
        temp_fuzzy_dict={label : fuzz.token_sort_ratio(label.upper(),test_label.upper()) for label in matched_labels_dict.keys()}
        fuzzy_dict={key : temp_fuzzy_dict[key] for key in temp_fuzzy_dict if temp_fuzzy_dict[key] > 91}
        try:
            max_value=max(fuzzy_dict.values())
            for label in fuzzy_dict:
                if fuzzy_dict[label]== max_value:
                    matched_labels_dict[test_label]=matched_labels_dict[label]
                    break
        except ValueError:
                    pass
return matched_labels_dict

I want to rerun the matching function (which will then add label_324 to the dictionary because it's match score with label_424) until the matched_labels_dictionary remains constant in size between two iterations. It would remain constant in size because no more matches were found.

Here is how I am doing this I arbitrarily set the limit of cycles as 100

for number in range(1,100):

    print 'cycle', number, 'number_of_matches', len(matched_labels_dict)
    x=do_machine_learning(all_labels,matched_labels_dict)
    if len(x)==len_matched_labels:
        break
    else:
        len_matched_labels=len(x) 

The do_machine_learning function is where the unmatched labels are compared and scored against the matched labels. Once the unmatched_labels are run through it the first time the matched_labels_dict is returned and the program compares the number of matched labels with the number of matched labels in the previous iteration. If the number has increased then the labels are sent back again to see if new matches can be made. If it completes running through without new matches being made then the the program breaks out of the loop. I was asked to put up my do_machine_learning function but I think it is irrelevant as my problem is how to more pythonically cycle through the loop above

So the question is how do I more cleanly set up this iterative process?

Well the question was closed and I don't really understand why but I think I figured out a better - cleaner way to handle this at least it worked for me I call the function from within itself until its size remains constant

def do_machine_learning(all_labels,matched_labels_dict, min_score):
    initial_size=len(matched_labels_dict)  # added this assignment
    for test_label in all_labels:
        if test_label not in matched_labels_dict:
            temp_fuzzy_dict={label :               fuzz.token_sort_ratio(label.upper(),test_label.upper()) for label in matched_labels_dict.keys()}
        fuzzy_dict={key : temp_fuzzy_dict[key] for key in temp_fuzzy_dict if temp_fuzzy_dict[key] > min_score}
        try:
            max_value=max(fuzzy_dict.values())
            for label in fuzzy_dict:
                if fuzzy_dict[label]== max_value:
                    matched_labels_dict[test_label]['NEW_LABEL'] = matched_labels_dict[label]['NEW_LABEL']
                    matched_labels_dict[test_label]['FUZZ_SCORE'] = max_value
                    matched_labels_dict[test_label]['BEST_MATCH'] = label
                    break
        except ValueError:
                    pass
    if len(matched_labels_dict)!=initial_size:           # added this loop
        do_machine_learning(all_labels,matched_labels_dict, min_score)
return matched_labels_dict

with those minor changes I can call the function by

new_matched_labels=do_machine_learning(all_labels,matched_labels_dict)

Those changes completely eliminate the need for the loop that starts with

for number in range(1,100):