My data:
State N Var1 Var2
Alabama 23 54 42
Alaska 4 53 53
Arizona 53 75 65
Var1
and Var2
are aggregated percentage values at the state level. N
is the number of participants in each state. I would like to run a linear regression between Var1
and Var2
with the consideration of N
as weight with sklearn in Python 2.7.
The general line is:
fit(X, y[, sample_weight])
Say the data is loaded into df
using Pandas and the N
becomes df["N"]
, do I simply fit the data into the following line or do I need to process the N somehow before using it as sample_weight
in the command?
fit(df["Var1"], df["Var2"], sample_weight=df["N"])
The weights enable training a model that is more accurate for certain values of the input (e.g., where the cost of error is higher). Internally, weights w are multiplied by the residuals in the loss function [1]:
Therefore, it is the relative scale of the weights that matters.
N
can be passed as is if it already reflects the priorities. Uniform scaling would not change the outcome.Here is an example. In the weighted version, we emphasize the region around last two samples, and the model becomes more accurate there. And, scaling does not affect the outcome, as expected.
(this transformation also seems necessary for passing
Var1
andVar2
tofit
)