可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a developed a code that reads very large files from FTP and writes it to local machine using Java. The code that does it is as follows . This is a part from the next(Text key, Text value) inside the RecordReader of the CustomInputFormat

 if(!processed)
            {
                            System.out.println("in processed");
                in = fs.open(file);
    processed=true; 
            }
while(bytesRead <= fileSize) {

                 byte buf[] = new byte[1024]; 

                try {
                    in.read(buf);
                    in.skip(1024);
                    bytesRead+=1024;
                    long diff = fileSize-bytesRead;
                    if(diff<1024)
                    {
                        break;
                    }
        value.set(buf, 0, 1024); // This is where the value of the record is set and it goes to the mapper . 
                } 
                catch(Exception e)
                {
                    e.printStackTrace();
                }

            }
            if(diff<1024)
            {
                int difference= (int) (fileSize-bytesRead);

                 byte buf[] = new byte[difference]; 
                in.read(buf);
                bytesRead+=difference;
            }

                    System.out.println("closing stream");
                    in.close();

After the write is over , I see that the transfer is done and the size of the file at the destination is same as that at the source. But I am unable to open the file and the editor gives the error as

gedit has not been able to detect the character coding.
Please check that you are not trying to open a binary file.
Select a character coding from the menu and try again.

This Question: Java upload jpg using JakartaFtpWrapper - makes the file unreadable is related to mine I believe , but I couldn't make sense of it.

Any pointers ?

回答1:

Your copying code is complete and utter 100% A grade nonsense. The canonical way to copy a stream in Java is as follows:

int count;
byte[] buffer = new byte[8192]; // or more if you like
while ((count = in.read(buffer)) > 0)
{
  out.write(buffer, 0, count);
}

Get rid of all the other fluff. It is just wasting time and space and clearly damaging your data in transit.

回答2:

I see many problems with your code. It is a strange way to read a whole file. for example:

in.read(buf);
in.skip(1024);
bytesRead+=1024;

is wrong, in.read(buf) returns the number of bytes read and sets the streams position to the current position old-position + n read bytes. So you don't need to skip - thats an error, as read positioned the stream already.

Verify the checksums of the files to be sure, they are the same. (using md5 or something) I'm pretty sure neither the checksums, nor the filesizes are the same.

You should use apache commons-io for file processing. Otherwise look at oracle docs on file processing.