I am doing a health science course where R or Stata are recommended. I'm trying to use Python / Numpy / Pandas instead as I wish to use it in future for financial time series analysis.
The data was Stata format so I copied the fields and saved them as a CSV
.
All fields imports are fine except that there are a number of columns of Yes/No some of which have blank fields.
Import command is
fhs = pd.io.parsers.read_csv('F:\\BioStatistics\\fds\\fhs_c2.csv', header=0, index_col=0)
If there is a blank field the dtype is object (makes sense)
If there are no blanks some columns convert to TRUE/FALSE
, others leave as Yes/No
but dtype is bool. Any idea why?
I want all to by one dtype and expressed one way for viewing + stat analysis.
I have achieve this by adding a row at the beginning with blank cells for the Boolean columns that had no spaces - so everything becomes an object. Then I use
fhs = fhs.drop([1002])
to drop that row and data types are still good.
I'd love to save it without this row and just be able to load the data each time with "correct" types but don't know if it possible when some of the columns will have all yes or no, and some will have blank cells. Is it possible?
Thanks, Sorry about the newbie question.
Example:
Importing
C1 C2 C3
R1 Yes Yes No
R2 No No No
R3 Yes Yes
R4 Yes Yes Yes
first column comes into df as Yes, No, Yes, Yes type bool xxxx below
2nd column comes into df as Yes, No, NaN, Yes type object
3rd column comes into df as FALSE, FALSE, TRUE, TRUE type bool
Damn. Just checked. I was wrong. If its yes and no then the column type is object.
I'd like to tell it when importing to make them all object and stick with yes and no because: 1. I think the 2nd column must be object (as its mixed otherwise i think) 2. The data set is in yes / no and other class members will be looking at yes and no
What happened when I tried the solution.
Here's my data: link
Here's the code:
from pandas import *
import numpy as np
import pandas as pd
def convert_bool(col):
if str(col).title() == "True": #check for nan
return "Yes"
elif str(col).title() == "False":
return "No"
else:
return col
fhs = pd.read_csv('F:\\BioStatistics\\fds\\StatExport.csv', converters={"death": lambda x:convert_bool(x)}, header=0, index_col=0)
and output link