I was playing with Mahout and found that the FileDataModel accepts data in the format
userId,itemId,pref(long,long,Double).
I have some data which is of the format
String,long,double
What is the best/easiest method to work with this dataset on Mahout?
One way to do this is by creating an extension of FileDataModel. You'll need to override the readUserIDFromString(String value) method to use some kind of resolver do the conversion. You can use one of the implementations of IDMigrator, as Sean suggests.
For example, assuming you have an initialized MemoryIDMigrator, you could do this:
@Override
protected long readUserIDFromString(String stringID) {
long result = memoryIDMigrator.toLongID(stringID);
memoryIDMigrator.storeMapping(result, stringID);
return result;
}
This way you could use memoryIDMigrator to do the reverse mapping, too. If you don't need that, you can just hash it the way it's done in their implementation (it's in AbstractIDMigrator).
userId and itemId can be string, so this is the CustomFileDataModel which will convert your string into integer and will keep the map (String,Id) in memory; after recommendations you can get string from id.
Assuming that your input fits in memory, loop through it. Track the ID for each string in a dictionary. If it does not fit in memory, use sort and then group by to accomplish the same idea.
In python:
import sys
import sys
next_id = 0
str_to_id = {}
for line in sys.stdin:
fields = line.strip().split(',')
this_id = str_to_id.get(fields[0])
if this_id is None:
next_id += 1
this_id = next_id
str_to_id[fields[0]] = this_id
fields[0] = str(this_id)
print ','.join(fields)