Here is my word vector :
google
test
stackoverflow
yahoo
I have assigned a value for these words as follows :
google : 1
test : 2
stackoverflow : 3
yahoo : 4
Here are some sample users and their words :
user1 google, test , stackoverflow
user2 test , google
user3 test , yahoo
user4 stackoverflow , yahoo
user5 stackoverflow , google
user6
To cater for users which do not have value contained in the word vector I assign '0'
Based on this, this corresponds to :
user1 1, 2 , 3
user2 2 , 1 , 0
user3 2 , 4 , 0
user4 3 , 4 , 0
user5 3 , 1, 0
user6 0 , 0 , 0
I am unsure if these are the correct values or even is correct approach for applying values to each word vector value so can apply 'Eucludeian distance' and 'correlation'. I'm basing this on snippet from book 'Programming Collective Intelligence' :
"Collecting Preferences The first thing you need is a way to represent different people and their preferences. If you were building a shopping site, you might use a value of 1 to indicate that someone had bought an item in the past and a value of 0 to indicate that they had not. "
For my dataset I do not have preference values so I am just using a unique numerical value to represent if a user contains a word in word vector or not.
Are these the correct values to set for my word vector ? How should I determine what these values should be ?