I am reading two columns of a csv file using pandas readcsv()
and then assigning the values to a dictionary. The columns contain strings of numbers and letters. Occasionally there are cases where a cell is empty. In my opinion, the value read to that dictionary entry should be None
but instead nan
is assigned. Surely None
is more descriptive of an empty cell as it has a null value, whereas nan
just says that the value read is not a number.
Is my understanding correct, what IS the difference between None
and nan
? Why is nan
assigned instead of None
?
Also, my dictionary check for any empty cells has been using numpy.isnan()
:
for k, v in my_dict.iteritems():
if np.isnan(v):
But this gives me an error saying that I cannot use this check for v
. I guess it is because an integer or float variable, not a string is meant to be used. If this is true, how can I check v
for an "empty cell"/nan
case?
Below are the differences:
nan
belongs to the classfloat
None
belongs to the classNoneType
I found the below article very helpful: https://medium.com/analytics-vidhya/dealing-with-missing-values-nan-and-none-in-python-6fc9b8fb4f31
NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.
Wes writes in the docs 'choice of NA-representation':
Note: the "gotcha" that integer Series containing missing data are upcast to floats.
In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.
Jeff comments (below) on this:
Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):
To answer the second question:
You should be using
pd.isnull
andpd.notnull
to test for missing data (NaN).The function
isnan()
checks to see if something is "Not A Number" and will return whether or not a variable is a number, for exampleisnan(2)
would return falseThe conditional
myVar is not None
returns whether or not the variable is definedYour numpy array uses
isnan()
because it is intended to be an array of numbers and it initializes all elements of the array toNaN
these elements are considered "empty"NaN
stants for NOT a number.None
might stand for any.NaN
can be used as a numerical value on mathematical operations, whileNone
cannot (or at least shouldn't).NaN
is a numeric value, as defined in IEEE 754 floating-point standard.None
is an internal Python tipe (NoneType
) and would be more like "inexistent" or "empty" than "numerically invalid" in this context.The main "symptom" of that is that, if you perform, say, an average or a sum on an array containing NaN, even a single one, you get NaN as a result...
In the other hand, you cannot perform mathematical operations using
None
as operand.So, depending on the case, you could use
None
as a way to tell your algorithm not to consider invalid or inexistent values on computations. That would mean the algorithm should test each value to see if it isNone
.Numpy has some functions to avoid NaN values to contaminate your results, such as
nansum
andnan_to_num
for example.