I have two DataFrames
and I want to perform the same list of cleaning ops.
I realized I can merge into one, and to everything in one pass, but I am still curios why this method is not working
test_1 = pd.DataFrame({
"A": [1, 8, 5, 6, 0],
"B": [15, 49, 34, 44, 63]
})
test_2 = pd.DataFrame({
"A": [np.nan, 3, 6, 4, 9, 0],
"B": [-100, 100, 200, 300, 400, 500]
})
Let's assume I want to only take the raws without NaN
s: I tried
for df in [test_1, test_2]:
df = df[pd.notnull(df["A"])]
but test_2
is left untouched. On the other hand if I do:
test_2 = test_2[pd.notnull(test_2["A"])]
Now I the first raw went away.
All these slicing/indexing operations create views/copies of the original dataframe and you then reassign df
to these views/copies, meaning the originals are not touched at all.
Option 1
dropna(...inplace=True)
Try an in-place dropna
call, this should modify the original object in-place
df_list = [test_1, test_2]
for df in df_list:
df.dropna(subset=['A'], inplace=True)
Note, this is one of the few times that I will ever recommend an in-place modification, because of this use case in particular.
Option 2
enumerate
with reassignment
Alternatively, you may re-assign to the list -
for i, df in enumerate(df_list):
df_list[i] = df.dropna(subset=['A']) # df_list[i] = df[df.A.notnull()]
You are modifying copies of the dataframes rather than the original dataframes.
One way to deal with this issue is to use a dictionary. As a convenience, you can use pd.DataFrame.pipe
together with dictionary comprehensions to modify your dictionaries.
def remove_nulls(df):
return df[df['A'].notnull()]
dfs = dict(enumerate([test_1, test_2]))
dfs = {k: v.pipe(remove_nulls) for k, v in dfs.items()}
print(dfs)
# {0: A B
# 0 1 15
# 1 8 49
# 2 5 34
# 3 6 44
# 4 0 63,
# 1: A B
# 1 3.0 100
# 2 6.0 200
# 3 4.0 300
# 4 9.0 400
# 5 0.0 500}
Note: In your result dfs[1]['A']
remains float
: this is because np.nan
is considered float
and we have not triggered a conversion to int
.
By using pd.concat
[x.reset_index(level=0,drop=True) for _, x in pd.concat([test_1,test_2],keys=[0,1]).dropna().groupby(level=0)]
Out[376]:
[ A B
0 1.0 15
1 8.0 49
2 5.0 34
3 6.0 44
4 0.0 63, A B
1 3.0 100
2 6.0 200
3 4.0 300
4 9.0 400
5 0.0 500]