Ways to calculate similarity

I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes:

age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others.

Can anyone tell me how to go about this problem or point me to some resources?

标签： statistics social-networking data-mining pattern-recognition similarity

6条回答

聊天终结者

2楼-- · 2019-01-31 09:48

Three steps to achieve a simple subjective metric for difference between two datapoints that might work fine in your case:

Capture all your variables in a representative numeric variable, for example: skin type (oily=-1, dry=1), hair type (long=2, short=0, medium=1),lifestyle (active outdoor lover=1, TV junky=-1), age is a number.
Scale all numeric ranges so that they fit the relative importance you give them for indicating difference. For example: An age difference of 10 years is about as different as the difference between long and medium hair, and the difference between oily and dry skin. So 10 on the age scale is as different as 1 on the hair scale is as different as 2 on the skin scale, so scale the difference in age by 0.1, that in hair by 1 and and that in skin by 0.5
Use an appropriate distance metric to combine the differences between two people on the various scales in one overal difference. The smaller this number, the more similar they are. I'd suggest simple quadratic difference as a first attempt at your distance function.

Then the difference between two people could be calculated with (I assume Person.age, .skin, .hair, etc. have already gone through step 1 and are numeric):

double Difference(Person p1, Person p2) {

    double agescale=0.1;
    double skinscale=0.5;
    double hairscale=1;
    double lifestylescale=1;

    double agediff = (p1.age-p2.age)*agescale;
    double skindiff = (p1.skin-p2.skin)*skinscale;
    double hairdiff = (p1.hair-p2.hair)*hairscale;
    double lifestylediff = (p1.lifestyle-p2.lifestyle)*lifestylescale;

    double diff = sqrt(agediff^2 + skindiff^2 + hairdiff^2 + lifestylediff^2);
    return diff;
}

Note that diff in this example is not on a nice scale like (0..1). It's value can range from 0 (no difference) to something large (high difference). Also, this method is almost completely unscientific, it is just designed to quickly give you a working difference metric.

0人赞添加讨论(0) 举报

Juvenile、少年°

3楼-- · 2019-01-31 09:51

Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874). For more check out this on page 47. If x contains any columns of these data-types, Gower's coefficient will be used as the metric.

For example

x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)

library(cluster)
daisy(x, metric = "euclidean")

you'll get :

Dissimilarities :
         1        2        3        4
2 2.000000                           
3 3.316625 2.236068                  
4 2.236068 1.732051 1.414214         
5 4.242641 3.741657 1.732051 2.645751

If you are interested on a method for dimensionality reduction for categorical data (also a way to arrange variables into homogeneous clusters) check this

0人赞添加讨论(0) 举报

Ridiculous、

4楼-- · 2019-01-31 09:51

You should read these two topics.

0人赞添加讨论(0) 举报

Melony?

5楼-- · 2019-01-31 09:59

Look at algorithms for computing srting difference. Its very similar to what you need. Store your attributes as a bit string and compute the distance between the strings

0人赞添加讨论(0) 举报

不美不萌又怎样

6楼-- · 2019-01-31 10:00

You probably should take a look for

Data Mining and Data Warehousing (Essential)
Machine Learning (Extra)
Artificial Neural Networks (Especially SOM)
Pattern Recognition (Related)

These topics will let you your program recognize similarities and clusters in your users collection and try to adapt to them...

You can then know different hidden common groups of related users... (i.e users with green hair usually do not like watching TV..)

As an advice, try to use ready implemented tools for this feature instead of implementing it yourself...
Take a look at Open Directory Data Mining Projects

0人赞添加讨论(0) 举报

再贱就再见

7楼-- · 2019-01-31 10:03

Give each attribute an appropriate weight, and add the differences between values.

enum SkinType
    Dry, Medium, Oily

enum HairLength
    Bald, Short, Medium, Long

UserDifference(user1, user2)
    total := 0
    total += abs(user1.Age - user2.Age) * 0.1
    total += abs((int)user1.Skin - (int)user2.Skin) * 0.5
    total += abs((int)user1.Hair - (int)user2.Hair) * 0.8
    # etc...
    return total

If you really need similarity instead of difference, use 1 / UserDifference(a, b)

0人赞添加讨论(0) 举报

Ways to calculate similarity

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间