How can dataframereader read http?

2019-02-25 11:20发布

问题:

My developing environment:

  • Intellij
  • Maven
  • Scala2.10.6
  • win7 x64

Dependencies:

 <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>2.2.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.10</artifactId>
        <version>2.2.0</version>
        <scope>provided</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>2.2.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>2.10.6</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-reflect -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-reflect</artifactId>
        <version>2.10.6</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.7.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.7.4</version>
    </dependency>
</dependencies>

problem :
I want read remote csv file into dataframe.
I tried next:

val weburl = "http://myurl.com/file.csv"
val tfile = spark.read.option("header","true").option("inferSchema","true").csv(weburl)

It returns next Error:

Exception in thread "main" java.io.IOException: No FileSystem for scheme: http

I tried next following internet searching(include stackoverflow)

val content = scala.io.Source.fromURL(weburl).mkString
val list = content.split("\n")
//...doing something to string and typecase, seperate each lows to make dataframe format.

it works fine, but I think more smart way to loading web source csv file.
Is there any way to DataframeReader can read HTTP csv?

I think setting SparkContext.hadoopConfiguration is some key, so I tried many codes in internet. but it didn't work and I don't know how to set and each meaning of code lines.

Next is one of my trying and it didn't work.(same error message on accessing "http")

val sc = new SparkContext(spark_conf)
val spark = SparkSession.builder.appName("Test").getOrCreate()
val hconf = sc.hadoopConfiguration


hconf.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hconf.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
hconf.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)

Is setting this is key? or not?
Or DataframeReader can't read directly from remote source? than how can i do it?
I need import some special library for http format?

The thing I want to know :

Is there any way to dataframereader can read HTTP source?
Without using their own parsing data. (like Best way to convert online csv to dataframe scala.)
I need to read CSV format. CSV is formal format. I think more general way to read data like dataframereader.csv("local file").

I know this question level too low. I'm sorry for my low-level of understanding.

回答1:

As far as I know it is not possible to read HTTP data directly. Probably the simplest thing you can do is to download file using SparkFiles, but it will duplicate data to each worker:

import org.apache.spark.SparkFiles

spark.sparkContext.addFile("http://myurl.com/file.csv")
spark.read.csv(SparkFiles.get("file.csv"))

Personally I'd just download the file upfront and put in a distributed storage.