Error while mapping the data from CSV file to a Hi

2019-08-08 15:32发布

问题:

I am trying to load a dataframe into a Hive table by following the below steps:

  1. Read the source table and save the dataframe as a CSV file on HDFS

    val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2016").option("user", devUserName).option("password", devPassword).option("partitionColumn","header_id").option("lowerBound", 199199).option("upperBound", 284058).option("numPartitions",10).load()
    
  2. Order the columns as per my Hive table columns My hive table columns are present in a string in the format of:

    val hiveCols = col1:coldatatype|col2:coldatatype|col3:coldatatype|col4:coldatatype...col200:datatype
    val schemaList        = hiveCols.split("\\|")
    val hiveColumnOrder   = schemaList.map(e => e.split("\\:")).map(e => e(0)).toSeq
    val finalDF           = yearDF.selectExpr(hiveColumnOrder:_*)
    

    The order of columns that I read in "execQuery" are same as "hiveColumnOrder" and just to make sure of the order, I select the columns in yearDF once again using selectExpr

  3. Saving the dataframe as a CSV file on HDFS:

    newDF.write.format("CSV").save("hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/")
    
  4. Once I save the dataframe, I take the same columns from "hiveCols", prepare a DDL to create a hive table on the same location with values being comma separated as given below:

create table if not exists schema.tablename(col1 coldatatype,col2 coldatatype,col3 coldatatype,col4 coldatatype...col200 datatype)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE

LOCATION 'hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/';

After I load the dataframe into the table created, the problem I am facing here is when I query the table, I am getting improper output in the query. For ex: If I apply the below query on the dataframe before saving it as a file:

finalDF.createOrReplaceTempView("tmpTable")
select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from tmpTable where header_id=19924598 and line_num=2

I get the output properly. All the values are properly aligned to the columns:

[19924598,2,null,null,381761.40000000000000000000,381761.4,-381761.40000000000000000000,-381761.4,0.01489610000000000000,0.014896100000000,5686.76000000000000000000,5686.76]

But after saving the dataframe in a CSV file, create a table on top of it (step4) and apply the same query on the created table I see the data is jumbled and improperly mapped with the columns:

select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from schema.tablename where header_id=19924598 and line_num=2

+---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
| header_id     | line_num     | debit_rate  | debit_rate_text  | credit_rate  | credit_rate_text  | activity_amount  | activity_amount_text  | exchange_rate  | exchange_rate_text  | amount_cr  | amount_cr_text  |
+---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
| 19924598      | 2            | NULL        |                  | 381761.4    |                    | 5686.76          | 5686.76               | NULL           | -5686.76            | NULL       |                 |

So I tried use a different approach where I created the hive table upfront and insert data into it from dataframe:

  • Running the DDL in step4 above
  • finalDF.createOrReplaceTempView("tmpTable")
  • spark.sql("insert into schema.table select * from tmpTable")

And even this way fails if I run the aforementioned select query once the job is completed. I tried to refresh the table using refresh table schema.table and msckrepair table schema.table just to see if there is any problem with the metadata but nothing seems to workout.

Could anyone let me know what is causing this phenomenon, is there is any problem with the way I operating the data here ?

回答1:

Codes are tested using Spark 2.3.2


Instead creating Spark dataframe from CSV file and then register it as Hive Table, you can easily run SQL commands and create Hive Tables from CSV file

val conf = new SparkConf
    conf
      .set("hive.server2.thrift.port", "10000")
      .set("spark.sql.hive.thriftServer.singleSession", "true")
      .set("spark.sql.warehouse.dir", "hdfs://PATH_FOR_HIVE_METADATA")
      .set("spark.sql.catalogImplementation","hive")
      .setMaster("local[*]")
      .setAppName("ThriftServer")

val spark = SparkSession.builder()
      .config(conf)
      .enableHiveSupport()
      .getOrCreate()

Now using spark object you can run SQL command as Hive user:

spark.sql("DROP DATABASE IF EXISTS my_db CASCADE")
spark.sql("create database if not exists my_db")
spark.sql("use my_db")

Using the following code you can load all csv_files in an HDFS directory (or you can give the path of exactly one CSV file):

spark.sql(
      "CREATE TABLE test_table(" +
        "id int," +
        "time_stamp bigint," +
        "user_name string) " +
        "ROW FORMAT DELIMITED " +
        "FIELDS TERMINATED BY ',' " +
        "STORED AS TEXTFILE " +
        "LOCATION 'hdfs://PATH_TO_CSV_Directory_OR_CSV_FILE' "
    )

And at the end register Spark sqlContext object as Hive ThriftServer:

HiveThriftServer2.startWithContext(spark.sqlContext)

This will create a ThriftServer endpoint on the port 10000.

INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads

Now you can run beeline and connect to the ThriftServer:

beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: enter optional_username
Enter password for jdbc:hive2://localhost:10000: leave blank
Connected to: Spark SQL (version 2.3.2)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>

And test if the table test_table is created under my_db database:

0: jdbc:hive2://localhost:10000> use my_db;
0: jdbc:hive2://localhost:10000> show tables ;
+-----------+-----------------------+--------------+--+
| database  |       tableName       | isTemporary  |
+-----------+-----------------------+--------------+--+
| my_db     | test_table            | false        |
+-----------+-----------------------+--------------+--+

Also, you can create any other Hive table (or any HiveQL command) using ThrifServer JDBC endpoint.

Here are the needed dependencies:

 libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "org.apache.spark" %% "spark-hive" % sparkVersion,
  "org.apache.spark" %% "spark-hive-thriftserver" % sparkVersion,
  "org.apache.hadoop" % "hadoop-hdfs" % "2.8.3",
  "org.apache.hadoop" % "hadoop-common" % "2.8.3",
)


回答2:

I used the rowformat serde: org.apache.hadoop.hive.serde2.OpenCSVSerde in the Hive DDL. This also has ',' as default separator char and I didn't have to give any other delimiter.