Problem with getting rid of specific columns [clos

2020-05-04 11:51发布

问题:

I have big dataset that has the next columns:

cols=['plant', 'time','date','hour','NDVI','Treatment','Line','397.01', '398.32', '399.63', '400.93', '402.24', '403.55'...,'1005']

I want to create new database which will contain the 7 first columns, then skip 10 and then have all the others.

I have done something like this:

df2=df_plants.iloc[:,10:]
df2.head()

but this cut the first columns and I need them as well.

I friend had reccomend me to do something like this:

#convert the ''numeric'' columns into float

float_cols = [float(i) for i in df_plants.columns.tolist()[4:] if type(i)==str]
df_plants.columns.values[4:] = float_cols


#detector edges removal
idx1 = (np.abs(df_plants.loc[:,float_cols].columns.values - 420))
#np.argmin(idx1)
idx2 = np.argmin(np.abs(df_plants.loc[:,float_cols].columns.values - 1005.0))

but when I apply it nothing happen and also i'm not sure I understand his idea in the detector edge part.

My end goal is to create new database that will contain the next columns: plant.line.treatment.time and then all the numeric columns that are greater than 410 .

Edit: the best thing for me is if I could tell python somehow that if in a numerical column there are negative values, remove it.

回答1:

I think betetr here is convert values to numeric and filter by mask:

cols=['plant', 'time','date','hour','NDVI','Treatment','Line',
      '397.01', '398.32', '399.63', '400.93', '402.24', '403.55','1005']

df = pd.DataFrame(columns=cols)

num = pd.to_numeric(df.columns, errors='coerce')

df = df.loc[:, (num > 410) | num.isna()]
print (df)
Empty DataFrame
Columns: [plant, time, date, hour, NDVI, Treatment, Line, 1005]
Index: []

If want also converting values to numeric:

def f(x):
    try:
        return float(x)
    except:
        return x

df = df.rename(columns=f)

def comp(x):
    try:
        return x > 410
    except:
        return True


df = df.loc[:, df.columns.map(comp)]
print (df)
Empty DataFrame
Columns: [plant, time, date, hour, NDVI, Treatment, Line, 1005.0]
Index: []


回答2:

Try using slicing:

cols = df_plants.columns.tolist()
df2=df_plants[cols[:7] + cols[17:]]
df2.columns = pd.to_numeric(df2.columns.tolist(), errors='ignore')