Latex to Word conversion with pandoc

Converting arbitrary Latex documents to Word documents without errors is probably impossible. However, pandoc can get you surprisingly far.

The basic idea of pandoc is that you specify the input (in our case, a .tex file) and output format (in our case, docx). If you have a simple latex document, this is enough:

pandoc main.tex -o main.docx

If you are not happy with the output, you can tinker with pandocs internal representation of the text (called an abstract syntax tree). In the following, I will explain how to fix problems with

  1. the citations,
  2. the figures captions,
  3. the equation references, and
  4. even supplementary figures.

Citations

In Latex, citations are inserted using \cite{Dumbledore1981} and there is a line \bibliography{references.bib} which specifies the Bibtex file that contains all the details about the references. Unfortunately, pandoc does not automatically extract the Bibtex file location, so you need to provide it to pandoc

pandoc main.tex \
  --citeproc --bibliography=references.bib \
  -o main.docx

Figures

By default, pandoc does a decent job detecting figures and inserting them in Word. However, for me, there were two problems: for some figures the figure caption was missing and the references in the text to the figure was not resolved.

Both problems were easy to solve by replacing all \begin{figure*} ... \end{figure*} with \begin{figure} ... \end{figure}

Equations

I was really impressed by the conversion of the latex equations to Word. Even for complex expressions, pandoc nailed the conversion:

I did get one error message that pandoc did not know how to handle the \Bigl( command. You can fix such issues by

  • manually removing the offending commands from the tex source or
  • redefining them in the preamble as no-ops: \newcommand{\Bigl}{}.

Pandoc does not natively support equation numbers, but the extension pandoc-crossref helps out. It creates a table with the equation on one side and the equation number on the right. To use pandoc-cross, you need to install it; then you can call

pandoc main.tex \
  -F pandoc-crossref -M autoEqnLabels \
  -o main.docx

Although pandoc-crossref does not officially support latex as an input format, it worked fine, except for resolving the equation references. I solved the problem by writing my own Lua filter based on the resolve-references.lua filter from Open Journals.

My Lua filter is available as a gist: Save it in the same directory as the tex file and call

pandoc main.tex \
  --lua-filter resolve_equation_labels.lua \
  -F pandoc-crossref -M autoEqnLabels \
  -o main.docx

Supplementary Figures

My tex document contained additional figures at the end, which I included as follows

% Defining a new counter
\newcounter{supfig}
\renewcommand\thesupfig{S\arabic{supfig}}

% ...

\begin{minipage}{\textwidth}
\centerline{\includegraphics[width=\textwidth]{figures/suppl_important_figure.pdf}}
\refstepcounter{supfig}
\label{fig:important_figure}
\vspace{.4cm}
{Suppl.\ Figure S\arabic{supfig}: Some important scientific results}
\end{minipage}\clearpage

Pandoc does not recognize the custom supfig counter, nor can it handle the \arabic{supfig} and the \label{fig:important_figure} commands. I resolved both problems by

  • adding the following commands in the preamble of the main.tex file

      \newcommand{\arabic}[1]{__||supfiganchor:display||__}
      \newcommand{\refstepcounter}[1]{__||supfiganchor:increment||__}
  • writing another Lua filter.

This last bit is fragile, as I insert magic strings with Latex commands into the text, which I then detect in the Lua filter to work around the limitations of pandoc. So be warned and use the filter at your own risk!

Summary

To keep my Latex document clean, I wrote a script that inserts the custom commands into the tex file right before the \begin{document} line and also replaces all instances of figure* with figure before calling pandoc:

MOCK_LATEX_COMMAND_FILE=latex_custom_commands.tex
TEX_FILE=main.tex

python_prog=$(cat << EOF
import re
with open("$TEX_FILE", "r") as input:
	lines = input.readlines()
with open("$MOCK_LATEX_COMMAND_FILE") as mocks:
	mock_commands = "".join(mocks.readlines())

begin_doc_regex = r"\\\w*begin\w*\{\w*document\w*\}\w*"
replace_strings = [
  [r"figure\*", "figure"]
]
for line in lines:
	m = re.match(begin_doc_regex, line)
	if m:
		print(mock_commands)
	for repl in replace_strings:
		line = re.sub(repl[0], repl[1], line)
	print(line, end = "")
EOF
)

python3 -c "$python_prog" | pandoc  \
    --from latex \
    --to docx \
    --lua-filter resolve_equation_labels.lua \
    --lua-filter resolve_suppl_fig_labels.lua \
    -F pandoc-crossref \
      -M autoEqnLabels \
    --citeproc --bibliography=references.bib \
    -o /main.docx