I have data in form:
37101000ssd48800^A1420asd938987^A2011-09-10^A18:47:50.000^A99.00^A1^A0^A
37101000sd48801^A44557asd03082^A2011-09-06^A13:24:58.000^A42.01^A1^A0^A
So first I took it literally and tried:
line = line.split("^A")
and also
line = line.split("\\u001")
So, the issue is:
The first approach works on my local machine if I do this:
cat input.txt | python mapper.py
It runs fine locally (input.txt is the above data), but fails on hadoop streaming clusters.
Someone told me that I should use "\\u001"
as the delimiter, but this is also not working, either on my local machine or on clusters.
For hadoop folks:
If I debug it on local using:
cat input.txt | python mapper.py | sort | python reducer.py
This runs just fine, if I use "^A"
as delimiter on local but I am getting errors when running on clusters, and the error code is not too helpful either...
Any suggestions on how can i debug this?
Thanks
If the original data uses a control-A as a delimiter, and it's just being printed as
^A
in whatever you're using to list the data, you have two choices:Pipe whatever you use the list the data into a Python script that uses
split('^A')
.Just use
split('\u001')
to split on actual control-A values.The latter is almost always going to be what you really want. The reason this didn't work from you is that you wrote
split('\\u001')
, escaping the backslash, so you're splitting on the literal string\u001
rather than on control-A.If the original data actually has
^A
(a caret followed by anA
) as the delimiter, just usesplit('^A')
.