Checksum verification in Hadoop

2020-02-25 08:13发布

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?

I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?

I read client does checksum before data is written to HDFS

Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.

5条回答
我欲成王,谁敢阻挡
2楼-- · 2020-02-25 08:52

If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.

In the below example I am comparing two files with the same content in different locations:

Old-school md5sum method returns the same checksum:

$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a  -

$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a  -

However, checksum generated on the HDFS is different for files with the same content:

$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914

$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e

A bit puzzling as I would expect identical checksum to be generated against the identical content.

查看更多
▲ chillily
3楼-- · 2020-02-25 08:58

If you are doing this check via API

import org.apache.hadoop.fs._
import org.apache.hadoop.io._

Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a

val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString

Option 2: for the value 3e50be59553b2ddaf401c575f8df6914

val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)
查看更多
【Aperson】
4楼-- · 2020-02-25 09:02

I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.

So, you can compare the checksum to cross check. https://github.com/srch07/HDFSChecksumForLocalfile

查看更多
Anthone
5楼-- · 2020-02-25 09:03

Checksum for a file can be calculated using hadoop fs command.

Usage: hadoop fs -checksum URI

Returns the checksum information of a file.

Example:

hadoop fs -checksum hdfs://nn1.example.com/file1 hadoop fs -checksum file:///path/in/linux/file1

Refer : Hadoop documentation for more details

So if you want to comapre file1 in both linux and hdfs you can use above utility.

查看更多
不美不萌又怎样
6楼-- · 2020-02-25 09:07

It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

查看更多
登录 后发表回答