I want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example:
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)
What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.
How about this?
How about creating two dataframes, each with different data types for their columns, and then appending them together?
Results
After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column.
Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers.
So, for your example:
When I've only needed to specify specific columns, and I want to be explicit, I've used (per DOCS LOCATION):
So, using the original question, but providing column names to it ...
this below code will change datatype of column.
in place of data type you can give your datatype .what do you want like str,float,int etc.
You have three main options for converting types in pandas.
1.
to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use
pandas.to_numeric()
.This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.
Basic usage
The input to
to_numeric()
is a Series or a single column of a DataFrame.As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
You can also use it to convert multiple columns of a DataFrame via the
apply()
method:As long as your values can all be converted, that's probably all you need.
Error handling
But what if some values can't be converted to a numeric type?
to_numeric()
also takes anerrors
keyword argument that allows you to force non-numeric values to beNaN
, or simply ignore columns containing these values.Here's an example using a Series of strings
s
which has the object dtype:The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to
NaN
as follows using theerrors
keyword argument:The third option for
errors
is just to ignore the operation if an invalid value is encountered:This last option is particularly useful when you want to convert your entire DataFrame, but don't not know which of our columns can be converted reliably to a numeric type. In that case just write:
The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with
to_numeric()
will give you either aint64
orfloat64
dtype (or whatever integer width is native to your platform).That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like
float32
, orint8
?to_numeric()
gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple seriess
of integer type:Downcasting to 'integer' uses the smallest possible integer that can hold the values:
Downcasting to 'float' similarly picks a smaller than normal floating type:
2.
astype()
The
astype()
method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to the any other.Basic usage
Just pick a type: you can use a NumPy dtype (e.g.
np.int16
), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).Call the method on the object you want to convert and
astype()
will try and convert it for you:Notice I said "try" - if
astype()
does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have aNaN
orinf
value you'll get an error trying to convert it to an integer.As of pandas 0.20.0, this error can be suppressed by passing
errors='ignore'
. Your original object will be return untouched.Be careful
astype()
is powerful, but it will sometimes convert values "incorrectly". For example:These are small integers, so how about converting to an unsigned 8-bit type to save memory?
The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!
Trying to downcast using
pd.to_numeric(s, downcast='unsigned')
instead could help prevent this error.3.
infer_objects()
Version 0.21.0 of pandas introduced the method
infer_objects()
for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:
Using
infer_objects()
, you can change the type of column 'a' to int64:Column 'b' has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use
df.astype(int)
instead.