how to read files from GetFilesProcessor in NiFi

2019-08-21 10:18发布

问题:

Below is my flow:

GetFile > ExecuteSparkInteractive > PutFile

I want to read files from GetFile processor in ExecuteSparkInteractive processor, apply some transformations and put it in some location. Below is my flow

I wrote spark scala code under code section of spark processor:

val sc1=sc.textFile("local_path")
sc1.foreach(println)

There is nothing happening in the flow. So how can I read files in spark processor using GetFile processor.

2nd Part:
I tried below flow just for practice:

ExecuteScript > PutFile > LogMessage

and I have mentioned below code in executescript processor:

readFile = open("/home/cloudera/Desktop/sample/data","r")
for line in readFile:
    lines = line.strip()
    finalline = re.sub(pattern='((?<=[0-9])[0-9]|(?<=\.)[0-9])',repl='X',string=lines)
readFile = open("/home/cloudera/Desktop/sample/data","w")
readFile.write(finalline)  

Code works fine but it doesn't write the formatted data into the destination folder. So where am I going wrong over here. Also, I installed pandas in local machine and ran pandas code from the executescript processor but nifi doesn't read pandas module. Why is it so ? I tried my best. Also, I couldn't find any relevant links for this where I can get basic flow

回答1:

This is not really how it works... GetFile is picking up files local to the NiFi node and bringing them into the NiFi flow for processing. ExecuteSparkInteractive kicks off a spark job on a remote Spark cluster, it does not transfer data to Spark. So you would likely want to put the data somewhere Spark can access it, maybe GetFile -> PutHDFS -> ExecuteSparkInteractive.