Sequence learning using Conditional Random Fields?

2019-03-03 07:08发布

问题:

I am new to sequential learning (& machine learning) & am trying to understand how to use conditional random fields to solve my problem.

I have a dataset which is a sequential log of when & where did an end user of my application worked. For example, the following dataset will only have values for User1

User   Facility   Weekday
User1  FacilityA  Monday
User1  FacilityB  Tuesday
User1  FacilityC  Wednesday
 ...     ...         ...

I am trying to solve the following problem: Given a weekday and facility a user worked on, what facility & weekday will they work next?

To solve this problem, I started looking at Conditional Random Fields, but am having a tough time for any library to work with it..

I tried to work with the following libraries: 1. PyStruct (https://pystruct.github.io/) But this did not work for me due to this issue: Index out of bounds: Fitting SSVM using Pystruct

  1. CRFSuite (http://www.chokkan.org/software/crfsuite/) (This has dependency on libBFGS. When i install libbfgs it on my ubuntu box without any errors, running 'make install' for CRFSuite still fails and says that it is unable to recognize libBFGS)

So i turned to another library.. 3. CRF++ (https://taku910.github.io/crfpp/)

I was able to install CRF++ & also am able to run the examples given in their distro. But, I need some help understanding how can i modify the template file to fit my usecase...

Also, i was thinking my labels will be a concatenated string of facility+weekday from the above dataset.

I am new to sequence learning & currently trying hard to research on how to solve this problem...

Any advice will be extremely helpful as I seem to be a bit stuck here..

Thanks!

回答1:

  1. Yes, since you are trying to predict two label ( Facility and Day ), concatenating of labels will be required. Else, you can also learn 2 different models for predicting each label (see point 3).

  2. I think you should look into regression models for this problem rather than CRFs.

  3. I think the arrangement of the data should be in such a way that log history of a user is learned easily. Can you tell me the 'minimum' history you have for 'any' user ( last 3 logins? 5 logins? 7 logins? ) ?

Assuming you have last 3 logins of every user. Then, if in your place, I would arrange the data in a different manner and learn 2 different models, one to predict day and another to predict facility. An example of arrangement of data and template file for predicting day is here. You similarly, change name of days of week to facility names and learn a model for predicting facility. Also you can think of and add more features to the ones that I have suggested. If you have more user data (say occupation or age or something ) then you should definitely try adding more columns to the training data and add these columns as features in template file. Remember, the testing file should arranged in the same way as training file (except last column can be empty/missing, because it is the label that is to be predicted by the model during testing).

If you want to go ahead and predict both label in one model, you can try concatenation (in the example that I've given you, day will now become day_facility).