Dropna isn't dropping, fillna isn't fillin

2019-07-31 17:23发布

问题:

I have a case where I am adding data from one dataframe to another, but I can't rid of the nan values.

Example data

df1 = pd.DataFrame(
        {
        'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.'],
        'ISSN_1': ['0096-0225', '0567-8315','0007-4977','0007-0963'],
        'ISSN_2': ['0096-0225','nan','0007-4977','0007-0963'],
        'ISSN_3': ['nan','1820-7448','nan','0366-077X'],
        'ISSN_4': ['nan','0567-8315','nan','1365-2133']
        }
        )

df1 = df1[['Journal'] + df1.columns[:-1].tolist()]
df2 = pd.DataFrame(
    {
    'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
    'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
    'Total Cites': ['223','444','324','166'],
    'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963']                           
     })

#this makes list of ISSNs from df1 to combine into a column to add to df2
xx=df1.set_index('Journal').values.tolist() 
df2['New']=df2.ISSN.apply(lambda x : [y for y in xx if x in y] )
df2=df2[df2.New.apply(len)>0]
df2['New']=df2.New.apply(pd.Series)[0].apply(lambda x : ','.join(x))

I have tried a replace: df2 = df2.replace(np.nan, '', regex=True)

I have tried dropna: print(df2.dropna(subset=['New']))

I have tried fillna: print(df2.fillna(''))

I have tried a replace list comprehension: xx = [value for value in xx if str(value) != 'nan']

No matter what I try, the "New" column is still full of nans.

0                0096-0225,0096-0225,nan,nan
1          0567-8315,nan,1820-7448,0567-8315
2                0007-4977,0007-4977,nan,nan
3    0007-0963,0007-0963,0366-077X,1365-2133

I want them skipped or dropped. I only want the valid ISSNs.

Thanks in advance for the help.

回答1:

There are a few things going on here. The first is that the question shows that 'nan' is in the dataframe, however the comment suggests that this should actually be nan (string versus null).

The second is that you are storing lists, and then strings of those lists in a dataframe which is typically discouraged - for precisely the reason you are running into - there is often unexpected behavior.

I will address the question as it was posed although you should be able to adapt this to nans as well

The code that is causing the issue is:

xx=df1.set_index('Journal').values.tolist() 
df2['New']=df2.ISSN.apply(lambda x : [y for y in xx if x in y] )
df2=df2[df2.New.apply(len)>0]
df2['New']=df2.New.apply(pd.Series)[0].apply(lambda x : ','.join(x))

The second line here is adding all of the values in xx to df2['New'] which contains 'nan' then subsequent lines turn these into a list and then a string. Once those values exist in a string or list you are not going to be able access them with normal pandas methods.

My suggestion would be to remove them from xx and then they won't show up in df2 at all:

xx=df1.set_index('Journal').values.tolist()
#get rid of nans here
xx=[[y for y in x if y != 'nan'] for x in xx]
df2['New']=df2.ISSN.apply(lambda x : [y for y in xx if x in y] )
df2=df2[df2.New.apply(len)>0]
df2['New']=df2.New.apply(pd.Series)[0].apply(lambda x : ','.join(x))

Note the second line here is removing the 'nan's at time when they are easily accessible.

This should get you what you need, though once again I would caution against storing lists in dataframes if possible and be sure to use nan and not 'nan'. Hope this helps!