Using Naive Bayes Classification to Identity a Twi

2019-08-11 06:07发布

I have become part of a project at school that has been a lot of fun so far and it just got a little bit more interesting. I have roughly 600,000 tweets in my possession (each contains screen name, geo location, text, etc.) and my goal is to try to classify each user as either male or female. Now using Twitter4J I can get what the user's full name, number of friends, re-tweets, etc. So I was wondering if a combination of looking at a users name and also doing text analysis would be a possible answer. I was originally thinking I could make this like a rule based classifier where I could first look at the user's name then analyze their text and attempt to come to a conclusion of M or F. I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?

Also with the names, I would be checking some kind of dictionary to interpret whether the name was male or female. I know there are cases where it's hard to tell but that's why I'd be looking at their tweet texts as well. I also forgot to mention; with these 600,000 tweets, I have at minimum two tweets per user available to me.

Any ideas or input on classifying a user's gender would be greatly appreciated! I don't have a ton of experience in this area and I'm looking to learn anything I can get my hands on.

2条回答
Rolldiameter
2楼-- · 2019-08-11 06:51
  • You need to develop a vocabulary linking name and gender.
  • Then you have to define features for each tweet.
  • Finaly you can use weka (java), Matlab, Python to build the learing set.

Main issues:

  1. Your language? To identify sex from name is easy in Italian (-a Female, -o Male [except Andrea, Luca] ) or get an eye here Does anyone know of a good library for mapping a person's name to his or her gender?
  2. second issue is a bit complicate you a need a semantic dictionary or you van analyse only the destination of the tweet (#to) or presence of url or image
查看更多
Root(大扎)
3楼-- · 2019-08-11 07:08

I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?

Any supervised learning algorithm, such as Naive Bayes, requires preparing training set. Without the actual gender for some data you cannot build such a model. On the other hand, if you come out with some rule bases system (like the one based on the users' names) you can try a semi-supervised approach. Using your rule based system, you can create some labelling of your data, lets say that your rule based classifier is RC and can answer "Male", "Female", "Do not know", you can create a labelling of your data X using RC in a natural way:

X_m = { x in X : RC(x)="Male" }
X_f = { x in X : RC(x)="Female" }

Once you did it, you can create a training set for the supervised learning model using all your data except the one used for creating RC - so in this case - users' names (I assume, that RC answers "Male" or "Female" iff it is entirely "sure" about it). As a result, you will train a classifier, which will try to generalize concept of gender from all additional data (like words used, location etc.). Lets call it SC. After that, you can simply create a "complex" classifier:

C(x) = "Male" iff RC(x)= Male" or 
                  (RC(x)="Do not know" && SC(x)="Male")
       "Female" iff RC(x)= Female" or 
                    (RC(x)="Do not know" && SC(x)="Female")

This way you can on one hand use the most valuable information (user name) in the rule based way, while in the same time exploit power of supervised learning for the "hard cases" while not having the "ground truth" in the first place.

查看更多
登录 后发表回答