可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?

I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?

I read client does checksum before data is written to HDFS

Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.

回答1:

If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.

In the below example I am comparing two files with the same content in different locations:

Old-school md5sum method returns the same checksum:

$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a  -

$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a  -

However, checksum generated on the HDFS is different for files with the same content:

$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914

$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e

A bit puzzling as I would expect identical checksum to be generated against the identical content.

回答2:

Checksum for a file can be calculated using hadoop fs command.

Usage: hadoop fs -checksum URI

Returns the checksum information of a file.

Example:

hadoop fs -checksum hdfs://nn1.example.com/file1 hadoop fs -checksum file:///path/in/linux/file1

Refer : Hadoop documentation for more details

So if you want to comapre file1 in both linux and hdfs you can use above utility.

回答3:

I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.

So, you can compare the checksum to cross check. https://github.com/srch07/HDFSChecksumForLocalfile

回答4:

If you are doing this check via API

import org.apache.hadoop.fs._
import org.apache.hadoop.io._

Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a

val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString

Option 2: for the value 3e50be59553b2ddaf401c575f8df6914

val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)

回答5:

It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

Checksum verification in Hadoop

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

收藏的人(0)

Checksum verification in Hadoop

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮