I hope every one of you is ok and the Covid19 is not affecting your life too much.
I am struggling with a PySpark code, in particular, I'd like to call a function on an object col
which is not iterable.
from pyspark.sql.functions import col, lower, regexp_replace, split
from googletrans import Translator
def clean_text(c):
c = lower(c)
c = regexp_replace(c, r"^rt ", "")
c = regexp_replace(c, r"(https?\://)\S+", "")
c = regexp_replace(c, "[^a-zA-Z0-9\\s]", "") #removePunctuation
c = regexp_replace(c, r"\n", " ")
c = regexp_replace(c, r" ", " ")
c = regexp_replace(c, r" ", " ")
# c = translator.translate(c, dest='en', src='auto')
return c
clean_text_df = uncleanedText.select(clean_text(col("unCleanedCol")).alias("sentence"))
clean_text_df.printSchema()
clean_text_df.show(10)
As soon as I run the code within c = translator.translate(c, dest='en', src='auto')
the error shown from Spark is TypeError: Column is not iterable
.
What I would like to do is a translation word by word:
From:
+--------------------+
| sentence|
+--------------------+
|ciao team there a...|
|dear itteam i urg...|
|buongiorno segnal...|
|hi team regarding...|
|hello please add ...|
|ciao vorrei effet...|
|buongiorno ho vis...|
+--------------------+
To:
+--------------------+
| sentence|
+--------------------+
|hello team there ...|
|dear itteam i urg...|
|goodmorning segna...|
|hi team regarding...|
|hello please add ...|
|hello would effet...|
|goodmorning I see...|
+--------------------+
The schema of the DataFrame
is:
root
|-- sentence: string (nullable = true)
Could anyone please help me?
Thank you very much
PySpark is just the Python API written to support Apache Spark. If you want to use custom python functions, you will have to define a user defined function (
udf
).Keep your
clean_text()
function as is (with thetranslate
line commented out) and try the following:The other functions in your original
clean_text
(lower
andregexp_replace
) are built-in spark functions and operate on apyspark.sql.Column
.Be aware that using this
udf
will bring a performance hit. See: Spark functions vs UDF performance?