How to convert HTML with mathjax into Latex using

2019-01-07 14:07发布

问题:

I have some HTML documents with MathJax equations, and I want to convert them to Latex, and then to pdf. I'd like to use Pandoc.

However, Pandoc replaces $ with \$ and it replaces \ in formulas with \textbackslash{}.

Is it possible to get Pandoc to pass MathJax formulas literally from HTML to Latex?

回答1:

With the latest version of pandoc (1.12.2), you can do this:

pandoc -f html+tex_math_dollars+tex_math_single_backslash -t latex

Much nicer! If you don't want to convert math delimited by \( and \), just do

pandoc -f html+tex_math_dollars -t latex


回答2:

It's not an easy task. Here's a solution that should work, provided you only use $ and $$ as math delimiters, and assuming your document doesn't contain any other uses of $. (If you can't assume that, you can try adjusting the perl regex in what follows.)

Step 1: Install the Haskell Platform, if you don't have it already, and 'cabal install pandoc' to get the pandoc library. (If you installed pandoc with the binary installer, you only have the executable, not the Haskell library.)

Step 2: Now write a small Haskell script -- we'll call it fixmath.hs:

import Text.Pandoc

main = toJsonFilter fixmath

fixmath :: Block -> Block
fixmath = bottomUp fixmathBlock . bottomUp fixmathInline

fixmathInline :: Inline -> Inline
fixmathInline (RawInline "html" ('<':'!':'-':'-':'M':'A':'T':'H':xs)) =
  RawInline "tex" $ take (length xs - 3) xs
fixmathInline x = x

fixmathBlock :: Block -> Block
fixmathBlock (RawBlock "html" ('<':'!':'-':'-':'M':'A':'T':'H':xs)) =
  RawBlock "tex" $ take (length xs - 3) xs
fixmathBlock x = x

Compile this:

ghc --make fixmath.hs

This will give you an executable fixmath. Now, assuming your input file is input.html, the following command should convert it to latex with the math intact, putting the result in output.html:

cat input.html | \
perl -0pe 's/(\$\$?[^\$]+\$\$?)/\<!--MATH$1-->/gm' | \
pandoc -s --parse-raw -f html -t json | \
./fixmath | \
pandoc -f json -t latex -s > output.tex

The first part is a perl one-liner that puts your math bits in special HTML comments marked "MATH". The second part parses the HTML into a JSON representation of the Pandoc data structure corresponding to the document. Then fixmath transforms this structure, changing the special HTML comments into raw LaTeX blocks and inlines. (See Scripting with pandoc for an explanation.) Finally we convert from JSON back to LaTeX.