Data Cleanup, post conversion from ALLCAPS to Titl

2019-08-22 17:36发布

问题:

Converting a database of people and addresses from ALL CAPS to Title Case will create a number of improperly capitalized words/names, some examples follow:

MacDonald, PhD, CPA, III

Does anyone know of an existing script that will cleanup all the common problem words? Certainly, it will still leave some mistakes behind (less common names with CamelCase-like spellings, i.e. "MacDonalz").

I don't think it matters much, but the data currently resides in MSSQL. Since this is a one-time job, I'd export out to text if a solution requires it.

There is a thread that posed a related question, sometimes touching on this problem, but not addressing this problem specifically. You can see it here:

SQL Server: Make all UPPER case to Proper Case/Title Case

回答1:

Don't know if this is of any help

private static function ucNames($surname) {
//  ( O\' | \- | Ma?c | Fitz )  # attempt to match Irish, Scottish and double-barrelled surnames
    $replaceValue = ucwords($surname);
    return preg_replace('/
                        (?: ^ | \\b )       # assertion: beginning of string or a word boundary
                        ( O\' | \- | Ma?c | Fitz )  # attempt to match Irish, Scottish and double-barrelled surnames
                        ( [^\W\d_] )        # match next char; we exclude digits and _ from \w
                        /xe',
                        "'\$1' . strtoupper('\$2')",
                        $replaceValue);
}

It's a simple PHP function that I use to set surnames to correct case that works for names like O'Connor, McDonald and MacBeth, FitzPatrick, and double-barrelled names like Hedley-Smythe



回答2:

Here is the answer I was looking for:

There is a data company, Melissa Data, who publishes some API and applications for database cleanup -- geared mostly around the direct marketing industry.

I was able to use two applications to solve my problem.

  1. StyleList: this app, among other things, converts ALL CAPS to mixed case and in the process it does not dirty up the data, leaving titles such as CPA, MD, III, etc. in tact; as well as natural, common camel-case names such as McDonalds.
  2. Personator: I used personator to break the Full Name fields into Prefix, First Name, Middle Name, Last Name, and Suffix. To be honest, it was far from perfect, but the data I gave it was pretty challenging (often no space separating a middle name and a suffix). This app does a number of other usefult things as well, including assigning gender to most names. It's available as an API you can call, too.

Here is a link to the solutions offered by Melissa Data:

http://www.melissadata.com/dqt/index.htm

For me, the Melissa Data apps did much of the heavy lifting and the remaining dirty data was identifiable and fixable in SQL by reporting on LEFT x or RIGHT x counts -- the dirt typically has the least uniqueness, patterns easily discovered and fixed.



标签: sql text grep