I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matrix from my arrays. How do I create a matrix from a RDD?
> imagerdd = traindf.map(lambda row: map(float, row.image))
> mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)
Traceback (most recent call last):
File "<ipython-input-21-6fdaa8cde069>", line 2, in <module>
mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
values = self._convert_to_array(values, np.float64)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
return np.asarray(array_like, dtype=dtype)
File "/usr/local/python/conda/lib/python2.7/site- packages/numpy/core/numeric.py", line 462, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
I'm getting the same error from every possible arrangement I can think of:
imagerdd = traindf.map(lambda row: Vectors.dense(row.image))
imagerdd = traindf.map(lambda row: row.image)
imagerdd = traindf.map(lambda row: np.array(row.image))
If I try
> imagedf = traindf.select("image")
> mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)
Traceback (most recent call last):
File "<ipython-input-26-a8cbdad10291>", line 2, in <module>
mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
values = self._convert_to_array(values, np.float64)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
return np.asarray(array_like, dtype=dtype)
File "/usr/local/python/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 462, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.