Is there any option in sqoop to import data from RDMS and store it as ORC file format in HDFS?
Alternatives tried: imported as text format and used a temp table to read input as text file and write to hdfs as orc in hive
Is there any option in sqoop to import data from RDMS and store it as ORC file format in HDFS?
Alternatives tried: imported as text format and used a temp table to read input as text file and write to hdfs as orc in hive
At least in Sqoop 1.4.5 there exists hcatalog integration that support orc file format (amongst others).
For example you have the option
--hcatalog-storage-stanza
which can be set to
stored as orc tblproperties ("orc.compress"="SNAPPY")
Example:
sqoop import
--connect jdbc:postgresql://foobar:5432/my_db
--driver org.postgresql.Driver
--connection-manager org.apache.sqoop.manager.GenericJdbcManager
--username foo
--password-file hdfs:///user/foobar/foo.txt
--table fact
--hcatalog-home /usr/hdp/current/hive-webhcat
--hcatalog-database my_hcat_db
--hcatalog-table fact
--create-hcatalog-table
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'
Sqoop import supports only below formats.
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data as parquet file (from sqoop 1.4.6 version)
In current version of sqoop available, it is not possible to import data from RDBS to HDFS in ORC format in a single shoot command. This is something known issue in sqoop. Reference link for this issue raised: https://issues.apache.org/jira/browse/SQOOP-2192
I think the only alternative available for now, is the same as you mentioned. I also came across the similar use case, and have used the alternative two step approach.
Currently there is no option to import the rdms table data directly as ORC file using sqoop. We can achieve the same using two steps.
Example: Step 1: Import the table data as a text file.
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera \
--table orders \
--target-dir /user/cloudera/text \
--as-textfile
Step 2: Use spark-shell on command prompt to get scala REPL command shell.
scala> val sqlHiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlHiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@638a9d61
scala> val textDF = sqlHiveContext.read.text("/user/cloudera/text")
textDF: org.apache.spark.sql.DataFrame = [value: string]
scala> textDF.write.orc("/user/cloudera/orc/")
Step 3: Check the output.
[root@quickstart exercises]# hadoop fs -ls /user/cloudera/orc/
Found 5 items
-rw-r--r-- 1 cloudera cloudera 0 2018-02-13 05:59 /user/cloudera/orc/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 153598 2018-02-13 05:59 /user/cloudera/orc/part-r-00000-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r-- 1 cloudera cloudera 153466 2018-02-13 05:59 /user/cloudera/orc/part-r-00001-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r-- 1 cloudera cloudera 153725 2018-02-13 05:59 /user/cloudera/orc/part-r-00002-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r-- 1 cloudera cloudera 160907 2018-02-13 05:59 /user/cloudera/orc/part-r-00003-24f75a77-4dd9-44b1-9e25-6692740360d5.orc