HTML to RTF Converter for .NET

2019-02-17 14:39发布

I've already seen lots of posts on the site for RTF to HTML and some other posts talking about some HTML to RTF converters, but I'm really trying to get a full breakdown of what is considered the most widely used commercial product, open source product or if people recommend going home grown. Apologies if you consider this a duplicate question, but I'm trying to create a product matrix to see what is the most viable for our application. I also think this would be helpful for others.

The converter would be used in an ASP.NET 2.0 application (we're upgrading to 3.5 shortly but still sticking with WebForms) using SQLServer 2005 (soon 2008) as the DB.

From reading a few posts, SautinSoft appears to be popular as a commercial component. Are there other commercial components that you'd recommend for converting HTML to RTF? Price does matter, but even if it's a little on the expensive side, please list it.

For open source, I read that OpenOffice.org can be run as a service so that it can convert files. However, this appears to be only Java based. I imagine, I'd need some kind of interop to use this? What .NET open source components, if any, are out there for converting HTML to RTF?

For home grown, is an XSLT the way to go with XHTML? If so, what component do you recommend for generating XHTML? Otherwise, what other home grown avenuses do you recommend.

Also, please note that I currently don't care so much about RTF to HTML. If a commercial component offers this and the price is still the same, fine, otherwise please don't mention it.

5条回答
我命由我不由天
2楼-- · 2019-02-17 15:30

I just came across this WYSIWYG rich text editor (RTE) for the web that also has an HTML to RTF converter, Cute Editor for .NET. Does anyone have any experience with this component? My main experience for web based RTEs have been CKEditor (fckEditor) and TinyMCE but as far as I can tell CKEditor and TinyMCE do not have HTML to RTF converters built in.

查看更多
ゆ 、 Hurt°
3楼-- · 2019-02-17 15:31

For what its worth and in no particular order.

A while ago i wanted to export to RTF and then import from RTF the RTF in question being manipulated by MS Word.

The first problem is RTF is not an open standard. It is an internal MS standard and there fore they alter it as and when they like and do not generally worry about compatibility. Currently the versions of RTF are 1.3 to 1.9 and they are all different. Internally they use twips for measurement just for good measure.

I bought the O'Reilly pocket book on the subject which helped and read a lot of the MS documentation which is good, but there is a lot of it and lots for each version.

Because of the way RTF is coded using regex to manipulate is incredibly hard work and needs careful handling and concentration to test and get to work. I use a Mac editor that had built in regex so i could steadily test each section and build it into the code.

Because of the number of versions there is also a lot of incompatibility between versions but there is a lot of commonality and in the end it was reasonably hard/easy to get where i wanted (after about a weeks reading and a weeks coding) and producing a really simple version.

I never found a commercial solution but i had to have a free on because of budget so that cut a lot out but take great care in choosing one to make sure it does what you want and has support.

I don't think where you are coming from HTML/XML/XHTML, i was converting CSV formats, it the RTF.

I am not sure if i would advise to DIY or buy. Probably on balance DIY but your own circumstances will dictate that.

Edit: One thing going from content to RTF is easier than vice versa.

BTW not criticising MS fior the RTF versions, hey it's theirs and proprietary so they can do what they like.

查看更多
做自己的国王
4楼-- · 2019-02-17 15:31

TL;DR: I recommend using the OpenXml format and the HtmlToOpenXml nuget package if possible.


Microsoft Word COM

I haven't really searched much into this topic as a my use case is to use the functionality on a server which makes COM components not a great selection.


XHTML2RTF

As @IAmTimCorey mentioned you can use this codeproject library.

Disadvantages are:

  • Limited supported HTML and CSS
  • Not really .NET
  • ...

Windows Forms Web Browser

As @Jerry mentioned you can use the Windows Forms WebBrowser control.

Disadvantages are:

  • Reference to System.Windows.Forms
  • Uses copy & paste (problematic for multithreading)
  • Only works in an STA thread

Not supported features include:

  • Fonts
  • Colors
  • Numbered lists
  • Strikethrough (del element)
  • ...

DevExpress

Code sample of "Paul V" from the devexpress support center. (03.02.2015)

public String ConvertRTFToHTML(String RTF)
{   
    MemoryStream ms = new MemoryStream();
    StreamWriter writer = new StreamWriter(ms);
    writer.Write(RTF);
    writer.Flush();
    ms.Position = 0;
    String output = "";
    HtmlEditorExtension.Import(HtmlEditorImportFormat.Rtf, ms, (s, enumerable) => output = s);

    return output;
}

public String ConvertHTMLToRTF(String Html)
{
    MemoryStream ms = new MemoryStream();
    var editor = new ASPxHtmlEditor { Html = html };

    editor.Export(HtmlEditorExportFormat.Rtf, ms);

    ms.Position = 0;
    StreamReader reader = new StreamReader(ms);

    return reader.ReadToEnd();
}

Or you could use the RichEditDocumentServer type as shown in this example.

Unknown what actually is supported.

Disadvantages are:

  • Price
  • Quite a lot of references for one small thing
  • More?

Not supported features include:

  • Striketrough (del element)

Sautinsoft

public string ConvertHTMLToRTF(string html)
{
    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    return h.ConvertString(htmlString);
}

public string ConvertRTFToHTML(string rtf)
{
    SautinSoft.RtfToHtml r = new SautinSoft.RtfToHtml();
    byte[] bytes = Encoding.ASCII.GetBytes(rtf);
    r.OpenDocx(bytes );
    return r.ToHtml();
}

More examples and configuration options can be found here and here.

Supported is the following:

  • HTML 3.2
  • HTML 4.01
  • HTML 5
  • CSS
  • XHTML

Disadvantages are:

  • I'm not sure how active the development is
  • Price

Usage knowledgebase:


DIY

If you only wanted to support limited functionality you could write your own converter. I would not recommend this if the supported feature set is too large.

I have a small sample project here but is only for educational purposes in its current state.


OpenXml

If the OpenXml format is also ok for your use case you can use the HtmlToOpenXml nuget package. Its free and did support all features I've tested the other solutions against.

The project is based on the Open Xml SDK by microsoft and seems active.

public static byte[] ConvertHtmlToOpenXml(string html)
{
    using (var generatedDocument = new MemoryStream())
    {
        using (var package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
        {
            var mainPart = package.MainDocumentPart;
            if (mainPart == null)
            {
                mainPart = package.AddMainDocumentPart();
                new Document(new Body()).Save(mainPart);
            }

            var converter = new HtmlConverter(mainPart);
            converter.ParseHtml(html);

            mainPart.Document.Save();
        }

        return generatedDocument.ToArray();
    }
}

查看更多
等我变得足够好
5楼-- · 2019-02-17 15:32

I would recommend doing it yourself as the task is not really that complex. Firstly, the easiest way convert one Xml format into another Xml format is with an Xslt. Converting Xml documents in C# is super easy.

Here is a good msdn blog post to get you started. Mike even mentions that it was easier to do this by hand that to deal with a third party.

link

Actually, I already answered this question here. Guess that makes this a duplicate.

查看更多
Emotional °昔
6楼-- · 2019-02-17 15:36

Since I'm required to implement some mailmerge capabilities with rich-text formatting on a Web application, I thought it'd be nice to share my experiences.

Basically, I explored two alternatives:

  • using Google Docs API to leverage Google Docs capabilities
  • using XSLT, as shown on this essay

Google Docs API works well. Problem is, when you upload an HTML document with page breaks, like this:

<p style="page-break-before:always;display:none;"/>

and ask Google to convert the doc in RTF, you lose all breaks, which does not fit my requirements. However, if page breaks aren't an issue for you, you might check this solution out.

The XSLT solution works... sort of.

It works if you reference MSXML3 COM object directly, bypassing System.Xml classes. Otherwise I couldn't make it work. Moreover, it seems to honor all but basic formatting and tags, disregarding text color, size and the like. However, it honors page breaks. :-)

Here's a quick library I wrote, using tidy.net to force HTML to XHTML conversion. Hope it helps.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace ADDS.Mailmerge
{

    public class XHTML2RTF
    {

        MSXML2.FreeThreadedDOMDocument _xslDoc;
        MSXML2.FreeThreadedDOMDocument _xmlDoc;
        MSXML2.IXSLProcessor _xslProcessor;
        MSXML2.XSLTemplate _xslTemplate;
        static XHTML2RTF instance = null;
        static readonly object padlock = new object();

        XHTML2RTF()
        {
            _xslDoc = new MSXML2.FreeThreadedDOMDocument();
            //XSLData.xhtml2rtf is a resource file 
            // containing XSL for transformation
            // I got XSL from here: 
            // http://www.codeproject.com/KB/HTML/XHTML2RTF.aspx
            _xslDoc.loadXML(XSLData.xhtml2rtf);
            _xmlDoc = new MSXML2.FreeThreadedDOMDocument();
            _xslTemplate = new MSXML2.XSLTemplate();
            _xslTemplate.stylesheet = _xslDoc;
            _xslProcessor = _xslTemplate.createProcessor();
        }

        public string ConvertToRTF(string xhtmlData)
        {
            try
            {
                string sXhtml = "";
                TidyNet.Tidy tidy = new TidyNet.Tidy();
                tidy.Options.XmlOut = true;
                tidy.Options.Xhtml = true;
                using (MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes(xhtmlData)))
                {
                    StringBuilder sb = new StringBuilder();
                    using (MemoryStream sw = new MemoryStream())
                    {
                        TidyNet.TidyMessageCollection messages = new TidyNet.TidyMessageCollection();
                        tidy.Parse(ms, sw, messages);
                        sXhtml = Encoding.UTF8.GetString(sw.ToArray());
                    }
                }

                _xmlDoc.loadXML(sXhtml);
                _xslProcessor.input = _xmlDoc;
                _xslProcessor.transform();
                return _xslProcessor.output.ToString();
            }
            catch (Exception exc)
            {
                throw new Exception("Error in xhtml conversion. ", exc);
            }
        }

        public static XHTML2RTF Instance
        {
            get
            {
                lock (padlock)
                {
                    if (instance == null)
                    {
                        instance = new XHTML2RTF();
                    }
                    return instance;
                }
            }
        }
    }



}
查看更多
登录 后发表回答