I have a data like this data = [[1.1, 1.2], [1.3, 1.4], [1.5, 1.6]]
I want to create a PySpark dataframe
I already use
dataframe = SQLContext.createDataFrame(data, ['features'])
but I always get
+--------+---+
|features| _2|
+--------+---+
| 1.1|1.2|
| 1.3|1.4|
| 1.5|1.6|
+--------+---+
how can I get result like below?
+----------+
|features |
+----------+
|[1.1, 1.2]|
|[1.3, 1.4]|
|[1.5, 1.6]|
+----------+
I find it's useful to think of the argument to createDataFrame()
as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.
You can get your desired output by making each element in the list a tuple:
data = [([1.1, 1.2],), ([1.3, 1.4],), ([1.5, 1.6],)]
dataframe = sqlCtx.createDataFrame(data, ['features'])
dataframe.show()
#+----------+
#| features|
#+----------+
#|[1.1, 1.2]|
#|[1.3, 1.4]|
#|[1.5, 1.6]|
#+----------+
Or if changing the source is cumbersome, you can equivalently do:
data = [[1.1, 1.2], [1.3, 1.4], [1.5, 1.6]]
dataframe = sqlCtx.createDataFrame(map(lambda x: (x, ), data), ['features'])
dataframe.show()
#+----------+
#| features|
#+----------+
#|[1.1, 1.2]|
#|[1.3, 1.4]|
#|[1.5, 1.6]|
#+----------+
You need a map
function to convert the tuples
to array
and use it in createDataFrame
dataframe = sqlContext.createDataFrame(sc.parallelize(data).map(lambda x: [x]), ['features'])
You should get as you desire
+----------+
| features|
+----------+
|[1.1, 1.2]|
|[1.3, 1.4]|
|[1.5, 1.6]|
+----------+
You should use the Vector Assembler function, from your code I guess you are doing this to train a machine learning model, and vector assembler works the best for that case. You can also add the assembler in the pipeline.
assemble_feature=VectorAssembler(inputCol=data.columns,outputCol='features')
pipeline=Pipeline(stages=[assemble_feature])
pipeline.fit(data).transform(data)