Count number of words in a spark dataframe

How can we find the number of words in a column of a spark dataframe without using REPLACE() function of SQL ? Below is the code and input I am working with but the replace() function does not work.

from pyspark.sql import SparkSession
my_spark = SparkSession \
    .builder \
    .appName("Python Spark SQL example") \
    .enableHiveSupport() \
    .getOrCreate()

parqFileName = 'gs://caserta-pyspark-eval/train.pqt'
tuesdayDF = my_spark.read.parquet(parqFileName)

tuesdayDF.createOrReplaceTempView("parquetFile")
tuesdaycrimes = spark.sql("SELECT LENGTH(Address) - LENGTH(REPLACE(Address, ' ', ''))+1 FROM parquetFile")

print(tuesdaycrimes.show())


+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|              Dates|      Category|            Descript|DayOfWeek|PdDistrict|    Resolution|             Address|          X|        Y|
+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|2015-05-14 03:53:00|      WARRANTS|      WARRANT ARREST|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:53:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:33:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|VANNESS AV / GREE...| -122.42436|37.800415|

标签： python apache-spark pyspark apache-spark-sql

4条回答

一夜七次

2楼-- · 2020-05-29 04:03

tuesdaycrimes.select("Address").map(x->x.split(" ")).flatmap().count()

0人赞添加讨论(0) 举报

看我几分像从前

3楼-- · 2020-05-29 04:11

You can define a udf function as

def splitAndCountUdf(x):
    return len(x.split(" "))

from pyspark.sql import functions as F
countWords = F.udf(splitAndCountUdf, 'int')

and call it using .withColumn function as

tuesdayDF.withColumn("wordCount", countWords(tuesdayDF.address))

And if you want distinct count of words, you can change the udf function to include set as

def splitAndCountUdf(x):
    return len(set(x.split(" ")))

from pyspark.sql import functions as F
countWords = F.udf(splitAndCountUdf, 'int')

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

4楼-- · 2020-05-29 04:14

You can do it just using split and size of pyspark API functions (Below is example):-

sqlContext.createDataFrame([['this is a sample address'],['another address']])\
.select(F.size(F.split(F.col("_1"), " "))).show()

Below is Output:-
+------------------+
|size(split(_1,  ))|
+------------------+
|                 5|
|                 2|
+------------------+

0人赞添加讨论(0) 举报

Animai°情兽

5楼-- · 2020-05-29 04:15

There are number of ways to count the words using pyspark DataFrame functions, depending on what it is you are looking for.

Create Example Data

import pyspark.sql.functions as f
data = [
    ("2015-05-14 03:53:00", "WARRANT ARREST"),
    ("2015-05-14 03:53:00", "TRAFFIC VIOLATION"),
    ("2015-05-14 03:33:00", "TRAFFIC VIOLATION")
]

df = sqlCtx.createDataFrame(data, ["Dates", "Description"])
df.show()

In this example, we will count the words in the Description column.

Count in each row

If you wanted the count of words in the specified column for each row you can create a new column using withColumn() and do the following:

Use pyspark.sql.functions.split() to break the string into a list
Use pyspark.sql.functions.size() to count the length of the list

For example:

df = df.withColumn('wordCount', f.size(f.split(f.col('Description'), ' ')))
df.show()
#+-------------------+-----------------+---------+
#|              Dates|      Description|wordCount|
#+-------------------+-----------------+---------+
#|2015-05-14 03:53:00|   WARRANT ARREST|        2|
#|2015-05-14 03:53:00|TRAFFIC VIOLATION|        2|
#|2015-05-14 03:33:00|TRAFFIC VIOLATION|        2|
#+-------------------+-----------------+---------+

Sum word count over all rows

If you wanted to count the total number of words in the column across the entire DataFrame, you can use pyspark.sql.functions.sum():

df.select(f.sum('wordCount')).collect() 
#[Row(sum(wordCount)=6)]

Count occurrence of each word

If you wanted the count of each word in the entire DataFrame, you can use split() and pyspark.sql.function.explode() followed by a groupBy and count().

df.withColumn('word', f.explode(f.split(f.col('Description'), ' ')))\
    .groupBy('word')\
    .count()\
    .sort('count', ascending=False)\
    .show()
#+---------+-----+
#|     word|count|
#+---------+-----+
#|  TRAFFIC|    2|
#|VIOLATION|    2|
#|  WARRANT|    1|
#|   ARREST|    1|
#+---------+-----+

0人赞添加讨论(0) 举报

Count number of words in a spark dataframe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间