I'm new in pyspark. I would like to perform some machine Learning on a text file.
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
and for my last command, i obtain the error "AttributeError: 'RDD' object has no attribute '_jdf'
can anyone help me please. thank you
You shouldn't be using
rdd
withCountVectorizer
. Instead you should try to form the array of words in thedataframe
itself asAnd then it should work so that you can call
transform
function asNow, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours
But I would suggest you to stick with
dataframe
way.