I am having a dataframe that contains columns named id, country_name, location and total_deaths. While doing data cleaning process, I came across a value in a row that has '\r'
attached. Once I complete cleaning process, I store the resulting dataframe in destination.csv file. Since the above particular row has \r
attached, it always creates a new row.
id 29
location Uttar Pradesh\r
country_name India
total_deaths 20
I want to remove \r
. I tried df.replace({'\r': ''}, regex=True)
. It isn't working for me.
Is there any other solution. Can somebody help?
Edit:
In the above process, I am iterating over df to see if \r
is present. If present, then need to replace. Here row.replace()
or row.str.strip()
doesn't seem to be working or I could be doing it in a wrong way.
I don't want specify the column name or row number while using replace()
. Because I can't be certain that only 'location' column will be having \r
. Please find the code below.
count = 0
for row_index, row in df.iterrows():
if re.search(r"\\r", str(row)):
print type(row) #Return type is pandas.Series
row.replace({r'\\r': ''} , regex=True)
print row
count += 1
Another solution is use str.strip
:
df['29'] = df['29'].str.strip(r'\\r')
print df
id 29
0 location Uttar Pradesh
1 country_name India
2 total_deaths 20
If you want use replace
, add r
and one \
:
print df.replace({r'\\r': ''}, regex=True)
id 29
0 location Uttar Pradesh
1 country_name India
2 total_deaths 20
In replace
you can define column for replacing like:
print df
id 29
0 location Uttar Pradesh\r
1 country_name India
2 total_deaths\r 20
print df.replace({'29': {r'\\r': ''}}, regex=True)
id 29
0 location Uttar Pradesh
1 country_name India
2 total_deaths\r 20
print df.replace({r'\\r': ''}, regex=True)
id 29
0 location Uttar Pradesh
1 country_name India
2 total_deaths 20
EDIT by comment:
import pandas as pd
df = pd.read_csv('data_source_test.csv')
print df
id country_name location total_deaths
0 1 India New Delhi 354
1 2 India Tamil Nadu 48
2 3 India Karnataka 0
3 4 India Andra Pradesh 32
4 5 India Assam 679
5 6 India Kerala 128
6 7 India Punjab 0
7 8 India Mumbai, Thane 1
8 9 India Uttar Pradesh\r\n 20
9 10 India Orissa 69
print df.replace({r'\r\n': ''}, regex=True)
id country_name location total_deaths
0 1 India New Delhi 354
1 2 India Tamil Nadu 48
2 3 India Karnataka 0
3 4 India Andra Pradesh 32
4 5 India Assam 679
5 6 India Kerala 128
6 7 India Punjab 0
7 8 India Mumbai, Thane 1
8 9 India Uttar Pradesh 20
9 10 India Orissa 69
If need replace only in column location
:
df['location'] = df.location.str.replace(r'\r\n', '')
print df
id country_name location total_deaths
0 1 India New Delhi 354
1 2 India Tamil Nadu 48
2 3 India Karnataka 0
3 4 India Andra Pradesh 32
4 5 India Assam 679
5 6 India Kerala 128
6 7 India Punjab 0
7 8 India Mumbai, Thane 1
8 9 India Uttar Pradesh 20
9 10 India Orissa 69
use str.replace
, you need to escape the sequence so it treats it as a carriage return rather than the literal \r
:
In [15]:
df['29'] = df['29'].str.replace(r'\\r','')
df
Out[15]:
id 29
0 location Uttar Pradesh
1 country_name India
2 total_deaths 20
The below code removes \n tab spaces, \n new line and \r carriage return and is great for condensing datum into one row. The answer was taken from https://gist.github.com/smram/d6ded3c9028272360eb65bcab564a18a
df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=<INPLACE>)
Just make df equal to the df.replace code line and then print df.
df=df.replace({'\r': ''}, regex=True)
print(df)