Background
Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion.
Problem
There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report. The data are sourced from non-human-friendly names such as:
payperiodmatchcode
labordistributioncodedesc
dependentrelationship actionendoption
actionendoptiondesc addresstype
addresstypedesc historytype
psaddresstype rolename
bankaccountstatus
bankaccountstatusdesc bankaccounttype
bankaccounttypedesc beneficiaryamount
beneficiaryclass beneficiarypercent
benefitsubclass beneficiaryclass
beneficiaryclassdesc benefitactioncode
benefitactioncodedesc
benefitagecontrol
benefitagecontroldesc
ageconrolagelimit
ageconrolnoticeperiod
Question
How would you automatically change such names to:
- pay period match code
- labor distribution code desc
- dependent relationship
Ideas
Use Google's Did you mean engine, however I think it violates their TOS:
lynx -dump «url» | grep "Did you mean" | awk ...
Languages
Any language is fine, but text parsers such as Perl would probably be well-suited. (The column names are English-only.)
Unnecessary Prefection
The goal is not 100% perfection in breaking words apart; the following outcome is acceptable:
- enrollmenteffectivedate -> Enrollment Effective Date
- enrollmentenddate -> Enroll Men Tend Date
- enrollmentrequirementset -> Enrollment Requirement Set
No matter what, a human will need to double-check the results and correct many. Whittling a set of 2,000 results down to 600 edits would be a dramatic time savings. To fixate on some cases having multiple possibilities (e.g., therapistname) is to miss the point altogether.
Sometimes, bruteforcing is acceptable:
#!/usr/bin/perl
use strict; use warnings;
use File::Slurp;
my $dict_file = '/usr/share/dict/words';
my @identifiers = qw(
payperiodmatchcode labordistributioncodedesc dependentrelationship
actionendoption actionendoptiondesc addresstype addresstypedesc
historytype psaddresstype rolename bankaccountstatus
bankaccountstatusdesc bankaccounttype bankaccounttypedesc
beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
beneficiaryclass beneficiaryclassdesc benefitactioncode
benefitactioncodedesc benefitagecontrol benefitagecontroldesc
ageconrolagelimit ageconrolnoticeperiod
);
my @mydict = qw( desc );
my $pat = join('|',
map quotemeta,
sort { length $b <=> length $a || $a cmp $b }
grep { 2 < length }
(@mydict, map { chomp; $_ } read_file $dict_file)
);
my $re = qr/$pat/;
for my $identifier ( @identifiers ) {
my @stack;
print "$identifier : ";
while ( $identifier =~ s/($re)\z// ) {
unshift @stack, $1;
}
# mark suspicious cases
unshift @stack, '*', $identifier if length $identifier;
print "@stack\n";
}
Output:
payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : * ageconrol age limit
ageconrolnoticeperiod : * ageconrol notice period
See also A Spellchecker Used to Be a Major Feat of Software Engineering.
I reduced your list to 32 atomic terms that I was concerned about and put them in longest-first arrangement in a regex:
use strict;
use warnings;
my $qr
= qr/ \G # right after last match
( distribution
| relationship
| beneficiary
| dependent
| subclass
| account
| benefit
| address
| control
| history
| percent
| action
| amount
| conrol
| option
| period
| status
| class
| labor
| limit
| match
| notice
| bank
| code
| desc
| name
| role
| type
| age
| end
| pay
| ps
)
/x;
while ( <DATA> ) {
chomp;
print;
print ' -> ', join( ' ', m/$qr/g ), "\n";
}
__DATA__
payperiodmatchcode
labordistributioncodedesc
dependentrelationship
actionendoption
actionendoptiondesc
addresstype
addresstypedesc
historytype
psaddresstype
rolename
bankaccountstatus
bankaccountstatusdesc
bankaccounttype
bankaccounttypedesc
beneficiaryamount
beneficiaryclass
beneficiarypercent
benefitsubclass
beneficiaryclass
beneficiaryclassdesc
benefitactioncode
benefitactioncodedesc
benefitagecontrol
benefitagecontroldesc
ageconrolagelimit
ageconrolnoticeperiod
Here is a Lua program that tries longest matches from a dictionary:
local W={}
for w in io.lines("/usr/share/dict/words") do
W[w]=true
end
function split(s)
for n=#s,3,-1 do
local w=s:sub(1,n)
if W[w] then return w,split(s:sub(n+1)) end
end
end
for s in io.lines() do
print(s,"-->",split(s))
end
Given that some words could be substrings of others, especially with multiple words smashed together, I think simple solutions like regexes are out. I'd go with a full-on parser, my experience being with ANTLR. If you want to stick with perl, I've had good luck using ANTLR parsers generated as Java through Inline::Java.
Peter Norvig has a great python script that has a word segmentation function using unigram/bigram statistics. You want to take a look at the logic for the function segment2 in ngrams.py. Details are in the chapter Natural Language Corpus Data from the book Beautiful Data (Segaran and Hammerbacher, 2009). http://norvig.com/ngrams/