I have a dataset with a column that is currently being treated as a factor with 1000+ levels. These are values for the column. I would like to clean up this data. Some values are strings like "-18 + 5 = -13" and "5 - 18 = -13", I would like the clustering to group these differently than say "R3no4".
Is this possible in R? I looked at the natural language processing task view http://cran.r-project.org/web/views/NaturalLanguageProcessing.html but I need to be pushed in the right direction.
the dataset is from the kdd 2010 cup I would like to create meaningful new columns from this column to aid in creating a predictive model. for example it would be nice to know if the string contains a certain operation, or if it contains no operations and instead is describing the problem.
my data frame looks like this:
str(data1)
'data.frame': 809694 obs. of 19 variables:
$ Row : int 1 2 3 4 5 6 7 8 9 10 ...
$ Anon.Student.Id : Factor w/ 574 levels "02i5jCrfQK","02ZjVTxC34",..: 7 7 7 7 7 7 7 7 7 7 ...
$ Problem.Hierarchy : Factor w/ 138 levels "Unit CTA1_01, Section CTA1_01-1",..: 80 80 80 80 80 80 80 80 80 80 ...
$ Problem.Name : Factor w/ 1084 levels "1PTB02","1PTB03",..: 377 377 378 378 378 378 378 378 378 378 ...
$ Problem.View : int 1 1 1 1 2 2 3 3 4 4 ...
$ Step.Name : Factor w/ 187539 levels "-(-0.24444444-y) = -0.93333333",..: 116742 177541 104443 64186 58776 58892 153246 153078 45114 163923 ...
I'm most interested in the Step.Name feature, since it contains the greatest number of unique factor values.
and some example values for step name:
[97170] (1+7)/4 = x
[97171] (1-sqrt(1^2-4*2*-6))/4 = x
[97172] (1-sqrt(1^2-(-48)))/4 = x
[97173] (1-sqrt(1-(-48)))/4 = x
[97174] (1-sqrt(49))/4 = x
[97175] (1-7)/4 = x
[97176] x^2+15x+44 = 0
[97177] a-factor-node
[97178] b-factor-node
[97179] c-factor-node
[97180] num1-factor-node
[97181] num2-factor-node
[97182] den1-factor-node
[97183] (-15?sqrt((-15)^2-4*1*44))/2 = x
[97184] (-15+sqrt((-15)^2-4*1*44))/2 = x
[97185] (-15+sqrt((-15)^2-176))/2 = x
[97186] (-15+sqrt(225-176))/2 = x
[97187] (-15+sqrt(49))/2 = x
[97188] (-15+7)/2 = x
[97189] (-15-sqrt((-15)^2-4*1*44))/2 = x
[97190] (-15-sqrt((-15)^2-176))/2 = x
[97191] (-15-sqrt(225-176))/2 = x
[97192] (-15-sqrt(49))/2 = x
[97193] (-15-7)/2 = x
[97194] 2x^2+x = 0
[97195] a-factor-node
[97196] b-factor-node
[97197] c-factor-node
[97198] num1-factor-node
[97199] num2-factor-node
[97200] den1-factor-node
[97201] (-1?sqrt((-1)^2-4*2*0))/4 = x
[97202] (-1+sqrt((-1)^2-4*2*0))/4 = x
[97203] (-1+sqrt((-1)^2-0))/4 = x
[97204] (-1+sqrt((-1)^2))/4 = x
[97205] (-1+1)/4 = x
[97206] (-1-sqrt((-1)^2-4*2*0))/4 = x
[97207] (-1-sqrt((-1)^2-0))/4 = x
[97208] (-1-sqrt((-1)^2))/4 = x
[97209] (-1-1)/4 = x
[97210] x^2-6x = 0
[97211] a-factor-node
[97212] b-factor-node