I'm quite new to programming and I'm jumping on python to get some familiarity with data analysis and machine learning.
I am following a tutorial on backward elimination for a multiple linear regression. Here is the code right now:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
#Taking care of missin' data
#np.set_printoptions(threshold=100)
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
#Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X = LabelEncoder()
X[:, 3] = labelEncoder_X.fit_transform(X[:, 3])
onehotecnoder = OneHotEncoder(categorical_features = [3])
X = onehotecnoder.fit_transform(X).toarray()
#Avoid the Dummy Variables Trap
X = X[:, 1:]
#Splitting data in train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
#Fitting multiple Linear Regression to Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#Predict Test set
regressor = regressor.predict(X_test)
#Building the optimal model using Backward Elimination
import statsmodels.formula.api as sm
a = 0
b = 0
a, b = X.shape
X = np.append(arr = np.ones((a, 1)).astype(int), values = X, axis = 1)
print (X.shape)
X_optimal = X[:,[0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,1,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,3,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
X_optimal = X[:,[0,3]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()
Now, the way the elimination is performed seems really manual to me, and I'd like to automate it. In order to do so I'd like to know if there is a way for me to have the pvalue of the regressor returned somehow (e.g if there is a method that does that in statsmodels). In that way I think I should be able to loop the features of the X_optimal array and see if the pvalue is greater than my SL and eliminate it.
Thank you!
Thank you Keith for your answer, Just some small fixes on Keith's loop to make it more efficient:
Ran into the same problem.
You can access the p-values through
They're stored as an array of float64s in scientific notation. I'm a bit new to python and I'm sure there are cleaner, more elegant solutions, but this was mine: