Python Pandas Mixed Boolean Yes/True and NaN Colum

2019-07-14 07:17发布

问题:

I am doing a health science course where R or Stata are recommended. I'm trying to use Python / Numpy / Pandas instead as I wish to use it in future for financial time series analysis.

The data was Stata format so I copied the fields and saved them as a CSV. All fields imports are fine except that there are a number of columns of Yes/No some of which have blank fields.

Import command is

fhs = pd.io.parsers.read_csv('F:\\BioStatistics\\fds\\fhs_c2.csv', header=0, index_col=0)

If there is a blank field the dtype is object (makes sense)

If there are no blanks some columns convert to TRUE/FALSE, others leave as Yes/No but dtype is bool. Any idea why?

I want all to by one dtype and expressed one way for viewing + stat analysis.

I have achieve this by adding a row at the beginning with blank cells for the Boolean columns that had no spaces - so everything becomes an object. Then I use fhs = fhs.drop([1002]) to drop that row and data types are still good.

I'd love to save it without this row and just be able to load the data each time with "correct" types but don't know if it possible when some of the columns will have all yes or no, and some will have blank cells. Is it possible?

Thanks, Sorry about the newbie question.

Example:

Importing

      C1    C2    C3

R1   Yes   Yes    No

R2    No    No    No

R3   Yes         Yes

R4   Yes   Yes   Yes

first column comes into df as Yes, No, Yes, Yes type bool xxxx below

2nd column comes into df as Yes, No, NaN, Yes type object

3rd column comes into df as FALSE, FALSE, TRUE, TRUE type bool

Damn. Just checked. I was wrong. If its yes and no then the column type is object.

I'd like to tell it when importing to make them all object and stick with yes and no because: 1. I think the 2nd column must be object (as its mixed otherwise i think) 2. The data set is in yes / no and other class members will be looking at yes and no

What happened when I tried the solution.

Here's my data: link

Here's the code:
from pandas import *
import numpy as np
import pandas as pd

def convert_bool(col):
    if str(col).title() ==  "True": #check for nan
        return "Yes"
    elif str(col).title() == "False":
        return "No"
    else:
        return col

fhs = pd.read_csv('F:\\BioStatistics\\fds\\StatExport.csv', converters={"death": lambda x:convert_bool(x)}, header=0, index_col=0)  

and output link

回答1:

You can use the converters field from pandas.read_csv

def convert_bool(col):
    if str(col).title() ==  "True": #check for nan
        return "YES"
    elif str(col).title() == "False":
        return "NO"
    else:
        return col
pandas.read_csv(file_in, converters={"C3": lambda x:convert_bool(x)})


回答2:

Note in upcoming pandas 0.9.2 (which includes a new file parser engine) you'll be able to do:

In [1]: paste
data = """A,B,C
Yes,No,Yes
No,Yes,Yes
Yes,,Yes
No,No,No"""

result = read_csv(StringIO(data), dtype=object)
## -- End pasted text --

In [2]: result
Out[2]: 
     A    B    C
0  Yes   No  Yes
1   No  Yes  Yes
2  Yes  NaN  Yes
3   No   No   No

In [3]: result = read_csv(StringIO(data), dtype=object, na_filter=False)

In [4]: result
Out[4]: 
     A    B    C
0  Yes   No  Yes
1   No  Yes  Yes
2  Yes       Yes
3   No   No   No

Or even (to get a NumPy array):

In [5]: result = read_csv(StringIO(data), dtype='S3', na_filter=False, as_recarray=True)

In [6]: result
Out[6]: 
array([('Yes', 'No', 'Yes'), ('No', 'Yes', 'Yes'), ('Yes', '', 'Yes'),
       ('No', 'No', 'No')], 
      dtype=[('A', '|S3'), ('B', '|S3'), ('C', '|S3')])