Perl Japanese to English filename replacement

I put together a perl script that works to replace Japanese file names to English file names. But there are still a couple of things that I don’t quite understand well.

I have the following configuration Client OS:

Windows XP Japan

Notepad++, installed

Server:

Red Hat Enterprise Linux Server release 6.2

Perl v5.10.1

VIM : VIM version 7.2.411

Xterm : ASTEC-X version 6.0

CSH: tcsh 6.17.00 (Astron)

The source of the files are Japanese .csv files generated on Windows. I saw posts about using utf8 and encoding conversion in Perl, and I hope to understand better why I didn’t need anything mentioned in the other threads.

Here is my script that worked? My questions are below.

#!/usr/bin/perl
my $work_dir = "/nas1_home4/fsomeguy/someplace";
opendir(DIR, $work_dir) or die "Cannot open directory";
my @files = readdir(DIR);
foreach (@files) 
{
    my $original_file = $_; 
    s/機/–machine_/; # replace 機 with -machine_
    my $new_file = $_;
    if ($new_file ne $original_file)
    {
        print "Rename " . $original_file . " to " . $new_file;
        rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or  print "Warning: rename failed because: $!\n";
    }
}

Questions:

1) Why isn’t utf8 required in this sample? In what type of examples would I need it. Use uft8; was discussed: use utf8 gives me 'Wide character in print')? But if I have added use utf8, then this script won’t work.

2) Why isn’t encoding manipulation required in this sample?
I actually wrote the script in Windows using Notepad++ (pasting in the Japanese characters from Windows XP Japan’s Explorer to my script). In Xterm, and VIM, the characters show up as garbage characters. But I didn’t have to deal with Encoding manipulation either, which was discussed here How can I convert japanese characters to unicode in Perl? .

Thanks.

UPDATES 1

Testing a simple localization sample in Perl for filename and file text replacement in Japanese

In Windows XP, copy the 南 character from within a .csv data file and copy to the clipboard, then use it as both the file name (ie. 南.txt) and file content (南). In Notepad++ , reading the file under encoding UTF-8 shows x93xEC, reading it under SHIFT_JIS displays南.

Script:

Use the following Perl script south.pl, which will be run on a Linux server with Perl 5.10

#!/usr/bin/perl
use feature qw(say);

use strict;
use warnings;
use utf8;
use Encode qw(decode encode);

my $user_dir="/usr/frank";
my $work_dir = "${user_dir}/test_south";

# forward declare the function prototypes
sub fileProcess;

opendir(DIR, ${work_dir}) or die "Cannot open directory " . ${work_dir};

# readdir OPTION 1 - shift_jis
#my @files = map { Encode::decode("shift_jis", $_); } readdir DIR; # Note filename    could not be decoded as shift_jis
#binmode(STDOUT,":encoding(shift_jis)");                    

# readdir OPTION 2 - utf8
my @files = map { Encode::decode("utf8", $_); } readdir DIR; # Note filename could be decoded as utf8
binmode(STDOUT,":encoding(utf8)");                           # setting display to output utf8

say @files;                                 

# pass an array reference of files that will be modified
fileNameTranslate();
fileProcess();

closedir(DIR);

exit;

sub fileNameTranslate
{

    foreach (@files) 
    {
        my $original_file = $_; 
        #print "original_file: " . "$original_file" . "\n";     
        s/南/south/;     

        my $new_file = $_;
        # print "new_file: " . "$_" . "\n";

        if ($new_file ne $original_file)
        {
            print "Rename " . $original_file . " to \n\t" . $new_file . "\n";
            rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or print "Warning: rename failed because: $!\n";
        }
    }
}

sub fileProcess
{

    #   file process OPTION 3, open file as shift_jis, the search and replace would work
    #   open (IN1,  "<:encoding(shift_jis)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    #   open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";  

    #   file process OPTION 4, open file as utf8, the search and replace would not work
open (IN1,  "<:encoding(utf8)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    open (OUT1, "+>:encoding(utf8)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";   

    while (<IN1>)
    {
        print $_ . "\n";
        chomp;

        s/南/south/g;


        print OUT1 "$_\n";
    }

    close IN1;
    close OUT1; 
}

Result:

(BAD) Uncomment Option 1 and 3, (Comment Option 2 and 4) Setup: Readdir encoding, SHIFT_JIS; file open encoding SHIFT_JIS Result: file name replacement failed.. Error: utf8 "\x93" does not map to Unicode at .//south.pl line 68. \x93

(BAD) Uncomment Option 2 and 4 (Comment Option 1 and 3) Setup: Readdir encoding, utf8; file open encoding utf8 Result: file name replacement worked, south.txt generated But south1.txt file content replacement failed , it has the content \x93 (). Error: "\x{fffd}" does not map to shiftjis at .//south.pl line 25. ... -Ao?= (Bx{fffd}.txt

(GOOD) Uncomment Option 2 and 3, (Comment Option 1 and 4) Setup: Readdir encoding, utf8; file open encoding SHIFT_JIS Result: file name replacement worked, south.txt generated South1.txt file content replacement worked, it has the content south.

Conclusion:

I had to use different encoding scheme for this example to work properly. Readdir utf8, and file processing SHIFT_JIS since the content of the csv file was SHIFT_JIS encoded.

标签： regex perl encoding utf-8

1条回答

Juvenile、少年°

2楼-- · 2019-07-29 22:17

Your script is totally unicode unaware. It treats all the strings as sequences of bytes. Fortunately, the bytes encoding the file names are identical to bytes encoding the Japanese characters used in the source. If you tell Perl to use utf8, it would interpret the Japanese characters in your script, but not the ones coming from the file system, so there will be no match.

0人赞添加讨论(0) 举报

Perl Japanese to English filename replacement

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间