How to read whole file in one string

2019-01-26 09:49发布

问题:

I want to read json or xml file in pyspark.lf my file is split in multiple line in

rdd= sc.textFIle(json or xml) 

Input

{
" employees":
[
 {
 "firstName":"John",
 "lastName":"Doe" 
},
 { 
"firstName":"Anna"
  ]
}

Input is spread across multiple lines.

Expected Output {"employees:[{"firstName:"John",......]}

How to get the complete file in a single line using pyspark?

Please help me I am new to spark.

回答1:

If your data is not formed on one line as textFile expects, then use wholeTextFile. This will give you the whole thing so that you can parse it down into whatever format you would like.



回答2:

There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:

textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).

1.) textFile

input: rdd = sc.textFile('/home/folder_with_text_files/input_file')

output: array containing 1 line of file as each entry ie. [line1, line2, ...]

2.) wholeTextFiles

input: rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.

[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]

3.) "Labeled" textFile

input:

import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)

for filename in glob.glob(Data_File + "/*"):
    Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)

output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.

[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
 ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
  ...]

You can also recombine either as a list of lines:

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
 ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]

Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):

Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()



回答3:

This is how you would do in scala

rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))


回答4:

"How to read whole [HDFS] file in one string [in Spark, to use as sql]":

e.g.

// Put file to hdfs from edge-node's shell...

hdfs dfs -put <filename>

// Within spark-shell...

// 1. Load file as one string
val f = sc.wholeTextFiles("hdfs:///user/<username>/<filename>")
val hql = f.take(1)(0)._2

// 2. Use string as sql/hql
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.sql(hql)