I have some HTML documents with MathJax equations, and I want to convert them to Latex, and then to pdf. I'd like to use Pandoc.
However, Pandoc replaces $
with \$
and it replaces \
in formulas with \textbackslash{}
.
Is it possible to get Pandoc to pass MathJax formulas literally from HTML to Latex?
With the latest version of pandoc (1.12.2), you can do this:
pandoc -f html+tex_math_dollars+tex_math_single_backslash -t latex
Much nicer! If you don't want to convert math delimited by \(
and \)
, just do
pandoc -f html+tex_math_dollars -t latex
It's not an easy task. Here's a solution that should work, provided you only use $
and $$
as math delimiters, and assuming your document doesn't contain any other uses of $
. (If you can't assume that, you can try adjusting the perl regex in what follows.)
Step 1: Install the Haskell Platform, if you don't have it already, and 'cabal install pandoc' to get the pandoc library. (If you installed pandoc with the binary installer, you only have the executable, not the Haskell library.)
Step 2: Now write a small Haskell script -- we'll call it fixmath.hs:
import Text.Pandoc
main = toJsonFilter fixmath
fixmath :: Block -> Block
fixmath = bottomUp fixmathBlock . bottomUp fixmathInline
fixmathInline :: Inline -> Inline
fixmathInline (RawInline "html" ('<':'!':'-':'-':'M':'A':'T':'H':xs)) =
RawInline "tex" $ take (length xs - 3) xs
fixmathInline x = x
fixmathBlock :: Block -> Block
fixmathBlock (RawBlock "html" ('<':'!':'-':'-':'M':'A':'T':'H':xs)) =
RawBlock "tex" $ take (length xs - 3) xs
fixmathBlock x = x
Compile this:
ghc --make fixmath.hs
This will give you an executable fixmath
. Now, assuming your input file is input.html
, the following command should convert it to latex with the math intact, putting the result in output.html
:
cat input.html | \
perl -0pe 's/(\$\$?[^\$]+\$\$?)/\<!--MATH$1-->/gm' | \
pandoc -s --parse-raw -f html -t json | \
./fixmath | \
pandoc -f json -t latex -s > output.tex
The first part is a perl one-liner that puts your math bits in special HTML comments marked "MATH". The second part parses the HTML into a JSON representation of the Pandoc data structure corresponding to the document. Then fixmath
transforms this structure, changing the special HTML comments into raw LaTeX blocks and inlines. (See Scripting with pandoc for an explanation.) Finally we convert from JSON back to LaTeX.