delimiting carat A in python

2019-02-25 16:46发布

问题:

I have data in form:

37101000ssd48800^A1420asd938987^A2011-09-10^A18:47:50.000^A99.00^A1^A0^A
37101000sd48801^A44557asd03082^A2011-09-06^A13:24:58.000^A42.01^A1^A0^A

So first I took it literally and tried:

line = line.split("^A")

and also

line = line.split("\\u001")

So, the issue is:

The first approach works on my local machine if I do this:

cat input.txt | python mapper.py

It runs fine locally (input.txt is the above data), but fails on hadoop streaming clusters.

Someone told me that I should use "\\u001" as the delimiter, but this is also not working, either on my local machine or on clusters.

For hadoop folks:

If I debug it on local using:

cat input.txt | python mapper.py | sort | python reducer.py

This runs just fine, if I use "^A" as delimiter on local but I am getting errors when running on clusters, and the error code is not too helpful either...

Any suggestions on how can i debug this?
Thanks

回答1:

If the original data uses a control-A as a delimiter, and it's just being printed as ^A in whatever you're using to list the data, you have two choices:

Pipe whatever you use the list the data into a Python script that uses split('^A').
Just use split('\u001') to split on actual control-A values.

The latter is almost always going to be what you really want. The reason this didn't work from you is that you wrote split('\\u001'), escaping the backslash, so you're splitting on the literal string \u001 rather than on control-A.

If the original data actually has ^A (a caret followed by an A) as the delimiter, just use split('^A').