Spark MlLib linear regression (Linear least square

Im new in spark and Machine learning in general. I have followed with success some of the Mllib tutorials, i can't get this one working:

i found the sample code here : https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression

(section LinearRegressionWithSGD)

here is the code:

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

// Save and load model
model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")

(that's exactly what's is on the website)

The result is

training Mean Squared Error = 6.2087803138063045

and

valuesAndPreds.collect

gives

    Array[(Double, Double)] = Array((-0.4307829,-1.8383286021929077),
 (-0.1625189,-1.4955700806407322), (-0.1625189,-1.118820892849544), 
(-0.1625189,-1.6134108278724875), (0.3715636,-0.45171266551058276), 
(0.7654678,-1.861316066986158), (0.8544153,-0.3588282725617985), 
(1.2669476,-0.5036812148225209), (1.2669476,-1.1534698170911792), 
(1.2669476,-0.3561392231695041), (1.3480731,-0.7347031705813306), 
(1.446919,-0.08564658011814863), (1.4701758,-0.656725375080344), 
(1.4929041,-0.14020483324910105), (1.5581446,-1.9438858658143454), 
(1.5993876,-0.02181165554398845), (1.6389967,-0.3778677315868635), 
(1.6956156,-1.1710092824030043), (1.7137979,0.27583044213064634), 
(1.8000583,0.7812664902440078), (1.8484548,0.94605507153074), 
(1.8946169,-0.7217282082851512), (1.9242487,-0.24422843221437684),...

My problem here is predictions looks totally random (and wrong), and since its the perfect copy of the website example, with the same input data (training set), i don't know where to look, am i missing something ?

Please give me some advices or clue about where to search, i can read and experiment.

Thanks

标签： apache-spark machine-learning apache-spark-mllib

2条回答

▲ chillily

2楼-- · 2020-02-02 00:24

Linear Regression is SGD based and requires tweaking the step size, see http://spark.apache.org/docs/latest/mllib-optimization.html for more details.

In your example, if you set the step size to 0.1 you get better results (MSE = 0.5).

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

// Build the model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
val model = regression.run(parsedData)

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

For another example on a more realistic dataset, see

https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/datasets/winequalityred_linearregression.md

https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/spark_shell_exporter/linearregression_winequalityred.scala

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2020-02-02 00:35

As explained by zero323 here, setting the intercept to true will solve the problem. If not set to true, your regression line is forced to go through the origin, which is not appropriate in this case. (Not sure, why this is not included in the sample code)

So, to fix your problem, change the following line in your code (Pyspark):

model = LinearRegressionWithSGD.train(parsedData, numIterations)

model = LinearRegressionWithSGD.train(parsedData, numIterations, intercept=True)

Although not mentioned explicitly, this is also why the code from 'selvinsource' in the above question is working. Changing the step size doesn't help much in this example.

0人赞添加讨论(0) 举报

Spark MlLib linear regression (Linear least square

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间