I have a twitter-like(another micro blog) data set with 1.6 million datapoints and tried to predict the its retweet numbers based on its content. I extracted its keyword and use the keywords as the bag of words feature. Then I got 1.2 million dimension feature. The feature vector is very sparse,usually only ten dimension in one data point. And I use SVR to do the regression. Now it has taken 2 days. I think the training time might take quite a long time. I don't know if I do this task like this is normal. Is there any way or is it necessary to optimize this problem?
BTW. If in this case , I don't use any kernel and the machine is 32GB RAM and i-7 16 cores. How long the training time will be in estimation? I used the lib pyml.
相关问题
- How to get a list of antonyms lemmas using Python,
- How to conditionally scale values in Keras Lambda
- Trying to understand Pytorch's implementation
- How to match dependency patterns with spaCy?
- LUIS - Can we use phrases list for new values in t
相关文章
- How to use cross_val_score with random_state
- How to measure overfitting when train and validati
- McNemar's test in Python and comparison of cla
- How to disable keras warnings?
- Invert MinMaxScaler from scikit_learn
- What's the difference between WordNet 3.1 and
- How should I vectorize the following list of lists
- ValueError: Unknown metric function when using cus
You need to find a dimensionality reduction approach that works for your problem.
I've worked on a similar problem to yours and I found that Information Gain worked well, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Ter-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
At first you can simply remove all words with high frequency and all words with low frequency, because both of them don't tell you much about content of a text, then you have to do a word-stemming.
After that you can try to reduce dimensionality of your space, with Feature hashing, or some more advance dimensionality reduction trick (PCA, ICA), or even both of them.