RTF to TEXT conversion using perl

2019-03-01 17:58发布

问题:

Can somebody tell me how can we convert the rtf file into text with all the tags, tables and formatted data using perl programming language ?

@Ahmad Bilal , @petersergeant : I have been using the below code for RTF to TXT conversion and i am able to convert into text. But the problem is i am unable to capture table or image formats and even all the entities in the inputfile are not captured using the program.

use 5.8.0;
use strict;
use warnings;
use Getopt::Long;
use Pod::Usage;
use RTF::HTMLConverter;

#-------------------------------------------------------------------
#Variable Declarions
#-------------------------------------------------------------------
my $tempfile = "";
my $Outfile = "";
my $txtfile = "";
my $URL = "";
my $Format = "";
my $TreeBuilder = "";
my $Parsed = "";
my $line = "";


my %opts;
GetOptions(
  "help|h|?"     => \$opts{help},
  "man|m"        => \$opts{man},
  "dom=s"        => \$opts{dom},
  "noimages|n"   => \$opts{noimages},
  "imagedir|d=s" => \$opts{imagedir},
  "imageuri|u=s" => \$opts{imageuri},
  "encoding|e=s" => \$opts{encoding},
  "indented|i=i" => \$opts{indented},
);

pod2usage(-verbose => 1, -exitval => 0) if $opts{help};
pod2usage(-verbose => 2, -exitval => 0) if $opts{man};

my %params;
if($opts{dom}){
  eval "require $opts{dom}";
  die $@ if $@;
  $params{DOMImplementation} = $opts{dom};
}else{
  eval { require XML::GDOME };
  if($@){
    eval { require XML::DOM };
    die "Can't load either XML::GDOME or XML::DOM\n" if $@;
    $params{DOMImplementation} = 'XML::DOM';
  }
}

if($opts{noimages}){
  $params{discard_images} = 1;
}else{
  $params{image_dir} = $opts{imagedir} if defined $opts{imagedir};
  $params{image_uri} = $opts{imageuri} if defined $opts{imageuri};
}

$params{codepage} = $opts{encoding} if $opts{encoding};
$params{formatting} = $opts{indented} if defined $opts{indented};

#-----------------------------------------------
# Converting RTF to HTML
#-----------------------------------------------

if(defined $ARGV[0]){
  open(FR, "< $ARGV[0]") or die "Can't open '$ARGV[0]': $!!\n";
    $params{in} = \*FR;
    $tempfile = $ARGV[0];
    $tempfile =~ /^(.*?)rtf/;
    $Outfile = $1."html";
    $txtfile = $1."txt";

  open(FW, "> $Outfile") or die "Can't open '$Outfile': $!!\n";
   $params{out} = \*FW;
   print "\n$Outfile - HTML Created\n"

}

my $parser = RTF::HTMLConverter->new(%params);
$parser->parse();


close FW;

#-----------------------------------------------
# Opening HTML and TXT files
#-----------------------------------------------

open (FILE1, ">$txtfile") or die "Can't open '$txtfile': $!!\n";
open (FILE2, "$Outfile") or die "Can't open '$Outfile': $!!\n";

#-----------------------------------------------
# Converting HTML to TXT file
#-----------------------------------------------

local $/ = undef;
while ($line = <FILE2>) {
    $line =~ s/\n//g;
    $line =~ s/(<!DOCTYPE HTML.*><html><head>.*<\/style>)/<sectd>/;
    $line =~ s/<font.*?>//g;
    $line =~ s/<\/font>//g;
    $line =~ s/<table .*?>/\n<table>\n/g;
    $line =~ s/<\/table>/\n<\/table>/g;
    $line =~ s/<td .*?>/\n<td>/g;
    $line =~ s/<tr>/\n<tr>/g;
    $line =~ s/<\/tr>/\n<\/tr>/g;
    $line =~ s/<ul.*?>/\n<ul>/g;
    $line =~ s/<li.*?>/\n<li>/g;
    $line =~ s/<\/ul>/\n<\/ul>/g;
    $line =~ s/<\/body><\/html>//g;
    $line =~ s/<p.*?>/\n<p>/g;
    $line =~ s/<p>(&nbsp;|\*|\s)+<\/p>//g;
    $line =~ s/&nbsp;//g;
  $line =~ s/(<sectd>\n?.*?)<\/head><body>/$1/g;

#-------------------
#  Entity Conversion
#-------------------
  $line =~ s/&rsquo;/&#x2018;/g;
  $line =~ s/“/&#x201C;/g;
  $line =~ s/”/&#x201D;/g;
  $line =~ s/¶/&para;/g;

    print FILE1 $line;
}

print "$txtfile - TXT file Created \n";

close FILE1;
close FILE2;

unlink ("$Outfile");

回答1:

I am the author of the linked module. Don't use it. If at all possible, shell out to a real RTF to text convertor like Pandoc.



回答2:

you need to use a module like this:

http://search.cpan.org/~sargie/RTF-Parser-1.12/lib/RTF/TEXT/Converter.pm