Checksum verification in Hadoop

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?

I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?

I read client does checksum before data is written to HDFS

Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.

标签： hadoop hdfs checksum

5条回答

我欲成王，谁敢阻挡

2楼-- · 2020-02-25 08:52

If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.

In the below example I am comparing two files with the same content in different locations:

Old-school md5sum method returns the same checksum:

$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a  -

$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a  -

However, checksum generated on the HDFS is different for files with the same content:

$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914

$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e

A bit puzzling as I would expect identical checksum to be generated against the identical content.

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2020-02-25 08:58

If you are doing this check via API

import org.apache.hadoop.fs._
import org.apache.hadoop.io._

Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a

val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString

Option 2: for the value 3e50be59553b2ddaf401c575f8df6914

val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)

0人赞添加讨论(0) 举报

【Aperson】

4楼-- · 2020-02-25 09:02

I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.

So, you can compare the checksum to cross check. https://github.com/srch07/HDFSChecksumForLocalfile

0人赞添加讨论(0) 举报

Anthone

5楼-- · 2020-02-25 09:03

Checksum for a file can be calculated using hadoop fs command.

Usage: hadoop fs -checksum URI

Returns the checksum information of a file.

Example:

hadoop fs -checksum hdfs://nn1.example.com/file1 hadoop fs -checksum file:///path/in/linux/file1

Refer : Hadoop documentation for more details

So if you want to comapre file1 in both linux and hdfs you can use above utility.

0人赞添加讨论(0) 举报

不美不萌又怎样

6楼-- · 2020-02-25 09:07

It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

0人赞添加讨论(0) 举报

Checksum verification in Hadoop

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间