I am new to hive/hadoop. I read/watched documentations, videos related to how hive, hdfs, hadoop works internally. But I still could not understand few things right off the bat. w.k.t Hive data is stored as files in hdfs and table structure (schema) is stored in metastore.
As, hive is schema on read only during the query execution time, the data and the schema integrates with each other and produces the result. Please confirm is my understanding correct on this statement?
As the statement 1 tells us about the integration, how does the integration happens? Like the files (actual data) stored in hdfs does not have schema right. How does mapreduce/hadoop/hive know, that "this particular data stored in the file" belongs to "this particular column of the table". Would not there be a data mis-match?
I would think-off hive data files would look like,
students.txt
-------------
1 abc m@gmail.com
-------------------
2 xyz@ymail.com
---------------
the above file does not store schema. Hence for student with s_id 2, the name is not stored. How those things are captured? when the query is executed? I don't think the xyz@gmail.com will be integrated under student_name field. But still would like to know how the integration happens?
Your understanding with respect to "Hive data is stored as files in hdfs and table structure (schema) is stored in metastore." is correct. But in addition to schema, Metastore also has the HDFS directory details where the table data is stored. This HDFS path information is used by queries at the time of execution.
Your understanding and my validations/answers:
- As, hive is schema on read only during the query execution time, the data and the schema integrates with each other and produces the result. Please confirm is my understanding correct on this statement?
Ans: CORRECT
- As the statement 1 tells us about the integration, how does the integration happens? Like the files (actual data) stored in hdfs does not have schema right. How does mapreduce/hadoop/hive know, that "this particular data stored in the file" belongs to "this particular column of the table". Would not there be a data mis-match?
Ans:
Files such as text files those are stored on HDFS which are part of a table won't have structure or column names in it but just the data. But, when the table is created, we will have to clearly mention the columns and how they are being stored in the text files. Let's say 2 columns and comma delimited data will have a query like below,
create table default.column_test
(name string,
email string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
So the data file i.e., the text file present in the HDFS path should have the data in below format.
alpha,alpha@email.com
beta,beta@email.com
When the data is queried from this table using a SELECT query, the query will get compiled at first and execution will happen on the data from the HDFS path which is derived from the Hive Metastore.
SELECT * FROM column_test;
column_test.name column_test.email
1 Alpha alpha@email.com
2 Beta beta@email.com
If the data in the file does not have value for the column name, (like below)
alpha@email.com
beta,beta@email.com
then the SELECT query will consider 'alpha@email.com' as the value for the column "name" and would return NULL as the value for the column "email" for the first record. The output will look like the below,
SELECT * FROM column_test;
column_test.name column_test.email
1 alpha@email.com NULL
2 Beta beta@email.com
Hope that helps!