I have become part of a project at school that has been a lot of fun so far and it just got a little bit more interesting. I have roughly 600,000 tweets in my possession (each contains screen name, geo location, text, etc.) and my goal is to try to classify each user as either male or female. Now using Twitter4J I can get what the user's full name, number of friends, re-tweets, etc. So I was wondering if a combination of looking at a users name and also doing text analysis would be a possible answer. I was originally thinking I could make this like a rule based classifier where I could first look at the user's name then analyze their text and attempt to come to a conclusion of M or F. I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?
Also with the names, I would be checking some kind of dictionary to interpret whether the name was male or female. I know there are cases where it's hard to tell but that's why I'd be looking at their tweet texts as well. I also forgot to mention; with these 600,000 tweets, I have at minimum two tweets per user available to me.
Any ideas or input on classifying a user's gender would be greatly appreciated! I don't have a ton of experience in this area and I'm looking to learn anything I can get my hands on.
Main issues:
Any supervised learning algorithm, such as Naive Bayes, requires preparing training set. Without the actual gender for some data you cannot build such a model. On the other hand, if you come out with some rule bases system (like the one based on the users' names) you can try a semi-supervised approach. Using your rule based system, you can create some labelling of your data, lets say that your rule based classifier is
RC
and can answer "Male", "Female", "Do not know", you can create a labelling of your dataX
usingRC
in a natural way:Once you did it, you can create a training set for the supervised learning model using all your data except the one used for creating
RC
- so in this case - users' names (I assume, thatRC
answers "Male" or "Female" iff it is entirely "sure" about it). As a result, you will train a classifier, which will try to generalize concept of gender from all additional data (like words used, location etc.). Lets call itSC
. After that, you can simply create a "complex" classifier:This way you can on one hand use the most valuable information (user name) in the rule based way, while in the same time exploit power of supervised learning for the "hard cases" while not having the "ground truth" in the first place.