I am importing study data into a Pandas data frame using read_csv
.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64
.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using
np.genfromtxt()
and create theDataFrame
from there.EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from
pandas 0.10
coming in November.As indicated in this question/answer by Lev Landau, there could be a simple solution to use
converters
option for a certain column inread_csv
function.You can refer to more options of
read_csv
funtion in pandas.io.parsers.read_csv documentation.Lets say I have csv file
projects.csv
like below:As for example below code is triming leading zeros:
Result:
Solution code example:
Required result:
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
use that mapping with
pd.read_csv()
:et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
You could also do:
By doing this you will have all your columns as strings and you won't lose any leading zeros.