I am writing some self contained integration tests around Apache Spark Streaming. I want to test that my code can ingest all kinds of edge cases in my simulated test data. When I was doing this with regular RDDs (not streaming). I could use my inline data and call "parallelize" on it to turn it into a spark RDD. However, I can find no such method for creating destreams. Ideally I would like to call some "push" function once in a while and have the tupple magically appear in my dstream. ATM I'm doing this by using Apache Kafka: I create a temp queue, and I write to it. But this seems like overkill. I'd much rather create the test-dstream directly from my test data without having to use Kafka as a mediator.
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Livy Server: return a dataframe as JSON?
- Web Test recorder does not allow me to record a te
- Factory_girl has_one relation with validates_prese
- What is the difference between `assert_frame_equal
- SQL query Frequency Distribution matrix for produc
- How do I send cookies with request when testing Fl
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
In addition to Raphael solution I think you like to also either can process one batch a time or everything available approach. You need to set oneAtATime flag accordingly on queustream's optional method argument as shown below:
I found this base example: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala
The key here is calling the "store" command. Replace the contents of store with whatever you want.
For testing purpose, you can create an input stream from a queue of RDDs. Pushing more RDDs in the queue will simulate having processed more events in the batch interval.