I have data in form:
37101000ssd48800^A1420asd938987^A2011-09-10^A18:47:50.000^A99.00^A1^A0^A
37101000sd48801^A44557asd03082^A2011-09-06^A13:24:58.000^A42.01^A1^A0^A
So first I took it literally and tried:
line = line.split("^A")
and also
line = line.split("\\u001")
So, the issue is:
The first approach works on my local machine if I do this:
cat input.txt | python mapper.py
It runs fine locally (input.txt is the above data), but fails on hadoop streaming clusters.
Someone told me that I should use "\\u001"
as the delimiter, but this is also not working, either on my local machine or on clusters.
For hadoop folks:
If I debug it on local using:
cat input.txt | python mapper.py | sort | python reducer.py
This runs just fine, if I use "^A"
as delimiter on local but I am getting errors when running on clusters, and the error code is not too helpful either...
Any suggestions on how can i debug this?
Thanks