tutorial.tex

\documentclass[11pt,letterpaper]{article}
\usepackage[letterpaper, top=1in, bottom=1in, left=1in, right=1.50in]{geometry}
\usepackage{xspace}
\usepackage{pifont}
\usepackage{url}
\usepackage{rotating}
\usepackage{tabularx}

% Enter the 21st century! Let's use utf8 and type accented characters directly.
\usepackage[utf8]{inputenc}
% T1 is a saner font encoding: keeps <, > and | looking right, and has code
% points for accented characters (replaces the default OT1)
\usepackage[T1]{fontenc}
% Modern font that allows ligatures to be encoded in such a way as to preserve
% searchability and cut-and-paste ability, e.g., different and finally have
% "fi" and "ff" each as two letters rather than a special font character.
\usepackage{lmodern}

% Need straight quotes in verbatim text.
\usepackage{upquote}
\usepackage{textcomp}
\newcommand\upquote[1]{\textquotesingle#1\textquotesingle}
\usepackage{alltt}
% This is needed to get bold in tt.
\renewcommand{\ttdefault}{txtt}
\newcommand{\bs}{\textbackslash{}}

% My own todo highlighting command
\usepackage{color}
\newcommand{\TODO}[1]{\emph{\textbf{\textcolor{red}{<TODO> #1 </TODO>}}}}

\newcommand{\New}{\textcolor{red}{[New]}\xspace}
\newcommand{\Changed}{\textcolor{red}{[Changed]}\xspace}

% Official typesetting of PORTAGEshared, now called Portage II
\newcommand{\PS}{PortageII\xspace}

% Try generating a PDF with coloured hyperlinks.
\newif\ifcolourlinks
%\colourlinksfalse
\colourlinkstrue

\ifcolourlinks
   \usepackage{color}
   \usepackage[colorlinks=true,
               pdftex,
               linktocpage,
               backref=page,
               pdftitle={PortageII 3.0 Tutorial},
               pdfauthor={Joanis, Stewart, Larkin and Foster},
               urlcolor=black % we use \url for code, not actual URLs
              ]{hyperref}
\else
   \usepackage{hyperref}
\fi


% \code formats an inline code snippet without line breaking; it treats the
% underscore as a normal character. Use \url to format a code snippet with
% automatic line breaking, but then underscores are not rendered as characters
% on which copy/paste works.
% \code breaks for text containing underscores when used inside a footnote;
% for \code calls inside footnotes, use \us{} to specify an underscore.
% Use \upquote{quote-text} for straight single quotes around text within a \code
% call.
\def\code{\begingroup\catcode`\_=12 \codex}
\newcommand{\codex}[1]{\texttt{#1}\endgroup}
\chardef\us=`\_

\newcommand{\phs}{\tild{s}}   % source phrase
\newcommand{\pht}{\tild{t}}   % target phrase
\newcommand{\tip}{\textbf{Useful Tip \large{\ding{43}} }}
\newcommand{\margintip}{\marginpar[{\textbf{Tip \large{\ding{43}}}}]{\textbf{\reflectbox{\large{\ding{43}}} Tip}}}
\newcommand{\tipsummary}{\noindent\textbf{Tip summary \large{\ding{43}} }}
\newcommand{\tipend}{\textbf{ \reflectbox{\large{\ding{43}}}}}

\usepackage{ifpdf}
\ifpdf
\setlength{\pdfpagewidth}{8.5in}
\setlength{\pdfpageheight}{11in}
\else
\fi

\title{\PS Tutorial: \\
       A detailed walk-through of the \\
       experimental framework}
\date{Last updated June 2018}
\author{Eric Joanis, Darlene Stewart, Samuel Larkin, George Foster}

\begin{document}

\vfill

\maketitle

\vfill

\begin{center}
An adaptation of George Foster's \emph{Running Portage: A Toy Example} \\
to Samuel Larkin's experimental framework,\\
%updated to reflect recommended usage of \PS.
updated to reflect recommended usage of PortageII 4.0
\end{center}

\vfill
\vfill

\begin{center}
{~} \\ \footnotesize
   Traitement multilingue de textes / Multilingual Text Processing \\
   Centre de recherche en technologies numériques / Digital Technologies Research Centre \\
   Conseil national de recherches Canada / National Research Council Canada \\
   Copyright \copyright\ 2008, 2009, 2010, 2011, 2012, 2013, 2016, 2018, Sa Majesté la Reine du Chef du Canada
   \\ Copyright \copyright\ 2008, 2009, 2010, 2011, 2012, 2013, 2016, 2018, Her Majesty in Right of Canada
\end{center}

\vfill

\newpage

%\vfill

\tableofcontents

%\vfill

\newpage


\section{Introduction}

This document describes how to run an experiment from end to end using the \PS
experimental framework. It is intended as a tutorial on using \PS, as well as a
starting point for further experiments.  Although the framework automates most
of the steps described below, we go through them one by one to
better explain how to use the \PS software suite.

\PS can be viewed as a set of programs for turning a bilingual corpus into a
translation system. Here this process is illustrated with a small ``toy''
example of French to English translation, using text from the Hansard corpus.
The training corpus is too small for good translation, but is used the same way
a more realistic setup would be. Total running time is one to several hours.

\subsection{Making Sure \PS is Installed}

To begin, you must build or obtain the \PS executables and ensure that they are
in your path, by sourcing the \code{SETUP.bash} file as
customized for your environment during installation of \PS.\footnote{There is
  also a \code{SETUP.tcsh} for users of that shell, but we strongly recommend
  using bash.  The examples in this document assume the use of bash.}
\code{SETUP.bash} also sets environment variable \code{\$PORTAGE} to the
directory where \PS is installed.  Follow the
instructions in \code{INSTALL} before you proceed with this document.

To make sure \PS is installed properly, run:\footnote{At NRC, replace this
instance of \code{\$PORTAGE} by your sandbox.}
\begin{small}
\begin{alltt}
   > \textbf{make -C \$PORTAGE/test-suite/unit-testing/check-installation}
\end{alltt}
\end{small}
You should see the message ``Installation successful'' near the end.

You can also try \code{canoe -h}, \code{tune.py -h}, \code{utokenize.pl -h},
\code{ce.pl -h}, and \code{filter-nc1.py -h}.
You should see usage information for each of these programs.

If you get error messages, then some part of your installation is incomplete.
See the section \emph{Verifying your installation} in \code{INSTALL} for
troubleshooting suggestions. Otherwise, you should be ready to proceed.

\subsection{Running the Toy Experiment}

Once \PS is installed, you should make a complete copy of the framework
directory hierarchy, because it is designed to work in place, creating the
models within the hierarchy itself.  The philosophy of the framework is that
each experiment is done in a separate copy, where you might do various
customizations depending on what each experiment is intended to test.

For example:
\begin{small}
\begin{alltt}
   > \textbf{mkdir experiments}
   > \textbf{cd experiments}
   > \textbf{cp -pr \$PORTAGE/framework toy.experiment}
   > \textbf{cd toy.experiment}
\end{alltt}
\end{small}

Alternatively, you can clone the framework repository directly from GitHub:
\begin{small}
\begin{alltt}
   > \textbf{mkdir experiments}
   > \textbf{cd experiments}
   > \textbf{git clone https://github.com/nrc-cnrc/PortageTrainingFramework.git toy.experiment}
   > \textbf{cd toy.experiment}
\end{alltt}
\end{small}

All commands provided in the rest of this document assume they are being run in
the \code{toy.experiment} directory or in a subdirectory thereof.  Whenever we
quote a \code{cd} command, we repeat the \code{toy.experiment} directory to
show explicitly where you should end up.

As you work through the example, the commands that you should type\footnote{This
   PDF document was generated in such a way that you can copy and paste commands
   from here onto the command line of your interactive shell if you wish.}
are shown in bold and preceded by a prompt, \code{>}, and the system's response
is not.  System output is not usually fully reproduced here, for brevity's
sake. When it is, results (especially numbers) may vary from the ones
shown, due to platform and random-number generation differences. Note that many results are
truncated in precision for presentation purposes.

Many of the commands are expressed as \code{make} targets. This has the
advantage of requiring less typing, while still allowing you to see the actual
commands executed by the system because they are always echoed by \code{make}.
(You could also type them directly.) \code{make} also lets you skip
sections of this document (except for the first one, since it is done
manually). For example, if you are not interested in any steps before decoder
weight optimization, you can go directly to \S\ref{COW} and type
\code{make tune} to begin at that point. \code{make} will automatically run all
the commands required from previous sections before doing the step you
specifically requested. Here are some other useful \code{make} commands:
\begin{itemize}
\item \code{make all}: run all remaining steps at any point.
\item \code{make clean}: clean up and return the directory to its initial state
\item \code{make -j} \emph{target} or \code{make -j} \emph{N target}: build
      \emph{target} by running commands in parallel whenever possible (up to
      \emph{N} ways parallel if \emph{N} is specified). This lets you take
      advantage of a computing cluster if you have one. If you use a single
      multi-core computer, \code{-j} has no effect since most commands in the
      framework are internally parallelized instead, as discussed in
      \S\ref{FrameworkParams}.
\item \code{make help}: display some help and the main targets available in
      the makefile.
\item \code{make summary}: display the resources used by the framework: time
      and memory used, as well as disk space for the runtime models (most
      informative once training has been completed; discussed further in
      \S\ref{FrameworkParams} and \S\ref{ResourceSummary}).
\end{itemize}

\tip\margintip When you run the whole process using \code{make all}, you should
also 1) save the output in a log file, 2) use \code{nohup}, and 3) background
the job. This way, if your terminal is disconnected, your job will continue and
you will not lose your logs:
\begin{small}
\begin{alltt}
   > \textbf{nohup make all >& log.make-all &}
\end{alltt}
\end{small}
To follow a job run this way as it is running, you can use \code{tail -f
log.make-all}.\tipend

\tip\margintip You can use \code{disown} to retroactively ``nohup'' a process:
Ctrl-Z and \code{bg} put it in the background; \code{jobs} tells you its job
number, typically \code{1}; \code{disown \%1} disconnects it from the shell,
protecting it from hangup signals.\tipend

\subsection{Overview of the Process}

%\TODO{Hide or remove rescoring stuff from here.}

Here is an overview of the \PS process, as described in the following
sections:
\begin{enumerate}
\item Corpus preparation (\S\ref{CorpusPreparation}) includes
      alignment,
      corpus splitting (\S\ref{Splitting}),
      tokenization (\S\ref{Tokenizing}),
      and
      lowercasing (\S\ref{Lowercasing}).
\item Model training (\S\ref{Training}) includes
      language model (\S\ref{LM}),
      coarse language model (\S\ref{coarseLM}),
      truecasing model (\S\ref{TC}),
      translation model (\S\ref{TM}),
      hierarchical lexicalized distortion model (\S\ref{LDM}),
      sparse model (\S\ref{sparse}),
      NNJMs (\S\ref{NNJM}),
      mixture models (\S\ref{MIX}),
      decoder weight optimization (\S\ref{COW}),
      confidence estimation model training (\S\ref{CE}),
      and
      rescoring model training (\S\ref{RAT}).
\item Translating and testing (\S\ref{TranslatingTesting}) includes
      decoding (\S\ref{Decoding}),
      confidence estimation(\S\ref{CETrans}),
      rescoring (\S\ref{RATTrans}),
      truecasing (\S\ref{Truecasing}),
      and
      testing (\S\ref{Testing}).
\end{enumerate}
These steps are fairly standard, but there are many variants on the
sample process illustrated here, which can be tuned for particular situations.

The \PS user manual (\code{\$PORTAGE/doc/user-manual.html})
has additional information, including a Background section with general
information on statistical machine translation and an annotated bibliography, but the technical details in
the user manual are outdated in many ways; we keep this tutorial up to date
with each release, and we try to make it thorough, so it is the best source of
information.  For detailed information about any individual program in \PS, run
the program with the \code{-h} flag (or see \code{\$PORTAGE/doc/usage.html}).


\section{Corpus Preparation} \label{CorpusPreparation}

To begin, let's go over some definitions of text formats:
\begin{description}
  \item[Plain text] is just normal text in a flat file without formatting.
  \item[Tokenized] text has spaces separating tokens, as in \texttt{I 'm green
  , you 're yellow !} Words and punctuation marks are tokens.
  \item[One-paragraph-per-line] (OPPL) text is fairly standard since line-wrapping is
  typically automatic nowadays.
  \item[One-sentence-per-line] (OSPL) text has been segmented in sentences.
  \item[Sentence-aligned] corpora are pairs of files in two languages where
  each line in one file is the translation of that same line in the other file.
  \item[Truecase] text has normal capitalization of proper words, first word of
  the sentence, etc.
  \item[Lowercase] text drops all casing information.
\end{description}

Corpus preparation involves converting raw document pairs into tokenized,
sentence-aligned files in the format expected by \PS.  We provide a
tokenizer, a sentence aligner and sample corpus pre-processing scripts with
\PS, which can help you with these steps.  For details, see section \emph{Text
Processing} in the user manual.

If your data exists in the form of a translation memory or an aligned bitext,
our \code{tmx-prepro} module (see \code{\$PORTAGE/tmx-prepro} or
\url{https://github.com/nrc-cnrc/PortageTMXPrepro}) can help you extract
your data and prepare it for training \PS.

This tutorial starts from sentence-aligned plain text, as you would get from
\code{tmx-prepro} extracting data from a TMX file.
We don't perform data clean up and sentence alignment here,
because they are highly dependent on your actual data and what other tools you
already have.  You should plan to invest some time in preprocessing your data
well if you want to obtain good results with \PS.

\subsection{Encoding: UTF-8}

The \PS framework only supports the UTF-8 encoding. If you use a different encoding,
please use \code{iconv} or \code{uconv} to convert your data to UTF-8, and the
\PS output to the encoding you need.

We previously supported latin1, cp-1252 and GB-2312 (simplified Chinese), but
UTF-8 can be used to represent all text, and its systematic use allows
us to both simplify the framework and make it more robust at the same time.

\subsection{Splitting the Corpus} \label{Splitting}

The corpus, which we assume you have sentence aligned, as
discussed above, must be split into separate portions to run
experiments.  Distinct, non-overlapping sub-corpora are required for model
training (see \S\ref{Training}), for tuning decoder weights (\S\ref{COW}) and
rescoring weights (\S\ref{RAT}), for confidence estimation (if you use it;
\S\ref{CE}), and for testing (\S\ref{Testing}).\footnote{In this example, we
use separate dev sets for tuning decoder and rescoring weights, but this is not
necessary.  However, confidence estimation must absolutely have its own
separate tuning set, which can be reused as a test set, but not as a decoder or
rescoring tuning set.} Typically, the tuning (or ``dev'', for development) and
testing sets contain around 2000 segments each.  If the corpus is chronological,
then it is a good idea to choose these sets from the most recent material,
which is likely to resemble future text more closely.\footnote{Our
\code{tmx-prepro} module can automate splitting your corpus if you're
starting from a TMX file. It uses random sampling by default for your dev and
test sets.}

Ideally, the splitting of a corpus should take into account its structure and
nature, so these steps are not handled by the experimental framework. For the
toy experiment, we provide small data sets that can be found here:
\code{\$PORTAGE/test-suite/tutorial-data}. These sets are ridiculously small,
to minimize running time, so the resulting translations are of poor quality.

To drop these sets into the framework, copy them (or make symbolic
links) into your copy's corpora directory:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{cp \$PORTAGE/test-suite/tutorial-data/*.raw* corpora/}
   > \textbf{wc_stats corpora/*.raw* | expand-auto.pl}
      #Lines   #Words   #Char     [...]   filename
      100      1912     11284     [...]   corpora/dev1_en.raw
      100      2026     13276     [...]   corpora/dev1_fr.raw
      100      1912     11176     [...]   corpora/dev2_en.raw
      100      2116     13474     [...]   corpora/dev2_fr.raw
      100      2156     12733     [...]   corpora/dev3_en.raw
      100      2461     15728     [...]   corpora/dev3_fr.raw
      8896     163417   954012    [...]   corpora/lm-train_en.raw.gz
      8893     178680   1139422   [...]   corpora/lm-train_fr.raw.gz
      100      2174     12981     [...]   corpora/test1_en.raw
      100      2267     15171     [...]   corpora/test1_fr.raw
      100      2156     12733     [...]   corpora/test2_en.raw
      100      2461     15728     [...]   corpora/test2_fr.raw
      8892     163338   953554    [...]   corpora/tm-train_en.raw.gz
      8892     178677   1139392   [...]   corpora/tm-train_fr.raw.gz

      36573    705753   4320664   [...]   TOTAL
\end{alltt}
\end{small}

In your own experiments, the files you need to copy into \code{corpora}
should be plain text, truecase, sentence-split and sentence-aligned, just like
the ones we provide here. If your corpora are already tokenized, call your
files \code{*.al} and \code{*.al.gz} instead of \code{*.raw} and
\code{*.raw.gz}: the framework will automatically skip tokenization with those
file extensions.

Although the framework does not support compressed dev and test files, the
training files can and should be compressed, as shown here.  Most programs and
modules in \PS transparently handle compressed files, compressing and
decompressing them on the fly as needed.

If you inspect them, you might notice that the \code{lm-train*} and
\code{tm-train*} files are almost identical, but that the \code{lm-train*} files have
more text.  When training the language model, it is a good idea to add
available monolingual material to the target side of the parallel corpus, as we
simulated doing here.  In practice, your LM training data might be a lot larger
than your parallel data: everything you have can help improve quality!

\subsection{Setting Framework Parameters} \label{FrameworkParams}

Now you need to edit \code{Makefile.params} to set some global parameters:
\begin{itemize}
\item swap the values of \code{SRC_LANG} and
\code{TGT_LANG}, to select translation from French to English, rather than
the other way around, which is the default;
\item \Changed\footnote{
   Starting with \PS 3.0, set your LM training corpus using \url{PRIMARY_LM} to
   get the recommended default use of the generic LM whenever possible, or
   \url{TRAIN_LM} or \code{MIXLM} if you want to override the default
   recommendations.
} \code{PRIMARY_LM} and \code{TRAIN_TM} already have the right values, so
they don't need to be changed;
\item \New \code{TRAIN_SPARSE} already defaults to the first corpus listed in
\code{TRAIN_TM}, in this case \code{tm-train}, so the new sparse models will be
trained on the same corpus as the phrase tables;
\item set \code{TUNE_RESCORE} to \code{dev2} and \code{TUNE_CE} to \code{dev3}
by uncommenting the lines defining them, i.e., by removing the \code{\#} at the
beginning of these lines;
\item \code{TEST_SET} already points to our two test sets, so no change is
needed;
\item select a language modeling option (see below for more info about this
  choice): set the \code{LM_TOOLKIT} variable to \code{MIT} or \code{SRI} to
  use MITLM or SRILM, respectively.
%\item set \code{DO_RESCORING} to \code{1} by uncommenting the relevant line.
\end{itemize}
For now, keep the default value for all other parameters.

While in \code{Makefile.params}, you should read through the variables in
the \emph{User definable variables} section of the file.  This is where most of
the configurable behaviours are set, such as whether to do rescoring and/or
truecasing, which optional models to use, the levels of parallelism, etc.

We recommend you use MITLM (\url{https://github.com/mit-nlp/mitlm}) or SRILM
(\url{http://www.speech.sri.com/projects/srilm/}) as your language modeling
toolkit.  Although not recommended, IRSTLM is another supported
option.\footnote{These recommendations are based on our
empirical results: we get similar BLEU scores when using MITLM and SRILM, but
lower ones when using IRSTLM.} See \code{Makefile.params} and type \code{make
help} for more details.

Another set of parameters to look at are the various
\code{PARALLELISM_LEVEL_*} variables.  \PS takes
advantage of multi-processor computers and/or multi-node computing clusters,
doing tasks in parellel where possible.  On a non-clustered computer,
the number of CPUs is the default for all these variables:
explicitly set \code{NCPUS} to restrict the number of CPUs used globally.
On a cluster, the framework uses
\code{qsub} to submit jobs, via the \code{run-parallel.sh} and \code{psub}
scripts, and you can set these variables according to resources available to
you.

When running this framework, many commands are preceded
by control variables \code{RP_PSUB_OPTS=...} or \code{_LOCAL=1}.  These strings
only have an impact on a cluster, and are ignored otherwise.
On a cluster, commands preceded by \code{_LOCAL=1} are inexpensive ones
that get run directly instead of being submitted to the queueing system, while
\code{RP_PSUB_OPTS=...} specifies additional options to \code{psub},
which encapsulates the invocation of \code{qsub}.  If your
cluster has specific usage rules or if you require additional parameters to
\code{qsub}, customize \code{psub} itself or add options as
required in this framework.

\tip\margintip Many commands run in this framework will also be preceded by
\code{time-mem}. This utility script measures the time and memory usage of a
command.  At any time, type \code{make summary} to get a summary of resources
used by all components of the framework so far.  \code{make
time-mem} can be very useful to determine which steps are taking the most
resources.  They can help you determine if you have enough
computing resources to process your corpora, and
the cost of various choices you can make in \PS.  \code{make
summary} will give you the \code{time-mem} information as well as the space on
disk of the models needed at runtime, such as would be deployed on a
translation server.\tipend

Most commands in the framework produce logs called
\code{log.\emph{output-file-name}}.  If you encounter errors, look for
explanations in these log files, that's usually where the error messages will
end up.

For the sake of brevity, commands quoted in this document
leave out \code{time-mem}, the control
variables mentioned above, and the log files.

\subsection{Tokenization} \label{Tokenizing}

This step is skipped if you're using tokenized data (with \code{.al} and
\code{.al.gz} extensions), but the
framework automatically recognizes that it needs to tokenize the raw text
corpora we provided for this tutorial.

By default, the \PS tokenizer is used (\code{utokenize.pl})\footnote{Look for
and set \code{TOKENIZER\us{}\emph{ll}} in \code{Makefile.params} to choose a
different tokenizer for language \emph{ll}}, along with a
preprocessing script that separates slash-separated words
(\code{fix-slashes.pl}) because those are rarely meant to stay together.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/corpora}
   > \textbf{make tok}
   \{ fix-slashes.pl | utokenize.pl -noss -lang=fr; \} < dev1_fr.raw > dev1_fr.al
   [...]
   parallelize.pl -nolocal -psub -1 -w 100000 -n 16 \bs
      "\{ fix-slashes.pl | utokenize.pl -noss -lang=en; \} < tm-train_en.raw.gz > tm-train_en.al.gz"
\end{alltt}
\end{small}

\subsection{Lowercasing and Escaping Special Characters} \label{Lowercasing}

To reduce data sparseness, we convert all files to lowercase.
We keep the lowercase and truecase
versions separate, because we'll use the lowercase version to train language
and translation models, while the truecase version will be used to train a
truecasing model.
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/corpora}
   > \textbf{make lc}
   cat dev1_fr.al | utf8_casemap -c l > dev1_fr.lc
   [...]
   zcat lm-train_fr.al.gz | utf8_casemap -c l | gzip > lm-train_fr.lc.gz
\end{alltt}
\end{small}

The decoder, \code{canoe}, treats \code{<}, \code{>} and \verb*X\X % *
as special characters to support markup for special translation rules.
We won't use markup in this tutorial, but we must still escape
the special characters in the input files
to \code{canoe}: the source side of the dev and test files.

\begin{small}
\begin{alltt}
   > \textbf{make rule}
   canoe-escapes.pl -add < dev1_fr.lc > dev1_fr.rule
   canoe-escapes.pl -add < dev2_fr.lc > dev2_fr.rule
   canoe-escapes.pl -add < dev3_fr.lc > dev3_fr.rule
   canoe-escapes.pl -add < test1_fr.lc > test1_fr.rule
   canoe-escapes.pl -add < test2_fr.lc > test2_fr.rule
\end{alltt}
\end{small}


\section{Training} \label{Training}

This step creates various models and parameter files that are required for
translation. There are many steps in training.  Three are mandatory: creating a
language model (\S\ref{LM}), creating a translation model (\S\ref{TM}), and
optimizing decoder weights (\S\ref{COW}).  Several are optional: creating coarse
language models (\S\ref{coarseLM}), creating a truecasing model (\S\ref{TC}),
creating a (possibly hierarchical) lexicalized distortion model (\S\ref{LDM}),
creating a sparse model (\S\ref{sparse}),
creating or fine-tuning an NNJM (\S\ref{NNJM}),
using mixture language models and mixture
translation models for domain adaptation (\S\ref{MIX}), training a confidence
estimation (CE) model (\S\ref{CE}), and training a rescoring model
(\S\ref{RAT}).

\subsection{Creating a Language Model} \label{LM}

\PS does not come with language model training software. However, it
accepts models in the widely-used ``ARPA'' format, which is supported by most
language modelling toolkits. In this tutorial, we assume
you are using MITLM. If you use SRILM, the procedure will be the same, but the
commands that get executed will be different.

By default, we train language models of order five, a
good compromise between translation quality and size of the models.  Higher
order language models might sometimes be useful, but only for very large
corpora, and at a cost in space and decoding speed.

In this toy example, we manually added a few sentences to the target language
part of the parallel training corpus to illustrate using more text to train the
language model than the translation model.  If you have access to relatively
small amounts of additional monolingual text, adding it to your main LM
training corpus is the simplest option.  If you have access to large amounts of
monolingual text, you can use it to train additional
language models or mixture language models.  To train separate language
models, drop the corpora into \code{corpora} and list all the corpus stems in
\code{TRAIN\us{}LM}; if an LM is trained externally, add its name to
\code{LM\us{}PRETRAINED\us{}TGT\us{}LMS} instead.  But for best results, use
mixture language models (see \S\ref{MIXLM}).

\subsubsection{Using a regular LM} \label{regularLM}

Since \PS 3.0, the recommendation and default is the use a MixLM of your
corpus and our generic language model (see \S\ref{LM+generic-default}).
But you can still use a single, regular LM, by setting
\code{TRAIN_LM=lm-train} instead of \code{PRIMARY_LM=lm-train} in
\code{Makefile.params}. We show what happens in that case here for
completeness.

Here is the command used to produce
\code{lm-train_en-kn-5g.tplm},
the main language model:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make lm}
   make -C models lm
   make -C lm all LM_LANG=en
   Creating ARPA text format lm-train_en-kn-5g.lm.gz
   zcat ../../corpora/lm-train_en.lc.gz \bs
      | perl -ple 's/^{\bs}s+//; s/{\bs}s+\$//; s/{\bs}s+/ /g;' | fold --bytes --spaces --width=4095 \bs
      | estimate-ngram -order 5 -smoothing ModKN -text /dev/stdin \bs
        -opt-perp ../../corpora/dev1_en.lc \bs
        -write-lm lm-train_en-kn-5g.lm.gz
   arpalm2binlm lm-train_en-kn-5g.lm.gz lm-train_en-kn-5g.binlm.gz
   arpalm2tplm.sh lm-train_en-kn-5g.lm.gz lm-train_en-kn-5g
\end{alltt}
\end{small}
The main command executed, MITLM's \code{estimate-ngram}, produces the language
model in ``ARPA'' format.  The \code{perl} filter that precedes it
removes extraneous whitespace, because \code{estimate-ngram} is picky about its
input.  The \code{fold} command wraps lines that are longer than 4KB, to avoid
breaking a word (or worse, a UTF-8 character) apart at MITLM's internal buffer
boundary.  The \code{arpalm2binlm} and \code{arpalm2tplm.sh} commands
convert the model into the \PS binary language model and
tightly-packed language model formats, respectively, for fast loading and use.

Since version 3.0 of \PS, we take advantage of MITLM's ability to tune the LM
metaparameters on our dev set, yielding slightly better language models.  The
tuning set is specified to \code{estimate-ngram} using \code{-opt-perp}, and is
configured in the framework via the \code{TUNE_LM} parameter in
\code{Makefile.params}.

Before continuing the tutorial, don't forget to revert your
\code{Makefile.params} to have \code{PRIMARY_LM=lm-train} and \code{TRAIN_LM}
undefined.

\subsubsection{In-domain LM plus the generic LM} \label{LM+generic-default}

Starting with \PS 3.0, we strongly recommend that you use the generic LM from
Portage Generic Model 2.0 (or more recent) to improve the quality of your translations.
Without introducing out-of-domain vocabulary into your translations, this model
helps the decoder chose the best way to use your in-domain phrase
table.

With \code{PRIMARY_LM=lm-train} defined as is now the default, you can issue
this command to train your a mixture language model (MixLM) combining your
in-domain LM and the generic one.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make mixlm}
   make[2]: Entering directory `.../toy.experiment/models/mixlm'
   [commands to make lm-train_fr-kn-5g.lm.gz and .tplm]
   echo "`basename models/mixlm/lm-train_fr*.tplm`" "generic-2.0_fr.tplm" \bs
      | tr " " "{\bs}n" > components_fr
   mx-calc-distances.sh -v em components_fr ../../corpora/dev1_fr.lc > dev1.distances
   mx-dist2weights -v normalize dev1.distances > dev1.weights
   [commands to make lm-train_en-kn-5g.lm.gz and .tplm]
   echo "`basename models/mixlm/lm-train_en*.tplm`" "generic-2.0_en.tplm" \bs
      | tr " " "{\bs}n" > components_en
   mx-mix-models.sh mixlm dev1.weights components_en ../../corpora/dev1_fr.lc \bs
      > dev1.mixlm
\end{alltt}
\end{small}

The above commands are explained in more details in \S\ref{MIXLM}, so we won't
describe what they do here. The important point here is that the mixture LM is
created by combining the primary LM created from your in-domain training data
with NRC's generic model created on 43 million sentence pairs. Because of your
in-domain data, this model is good at recognizing text that sounds right for
your domain. Because of the very large data set used to train the generic model,
this model is good at recognizing text that sounds right in English (or French)
in general. The combination helps the decoder chose better translations than
either component model could on its own.

\subsection{Creating a Coarse Language Model} \label{coarseLM}

In \PS 3.0, we added the capacity to train and use coarse language models:
language models that are trained and queried not on sequences of words, but
rather on sequences of classes of words. These models have a useful capacity of
abstraction: similar words can be expected to occur in
similar contexts.\footnote{See Stewart et al (AMTA 2014) and the annotated
bibliography included with the \PS user manual for more details on coarse LMs.}

Empirically, we get the best results when we combine a coarse LM trained
on 200 word classes with one trained on 800 classes. These models capture
abstractions at different levels of granularity. With only 200 classes, the
first one is modelling something that probably resembles part-of-speech tag
sequences, while 800 classes gives the second model a chance to capture some
semantic categories as well. As for the order of these models, coarse models do
not suffer from the data sparsity that affects regular LMs, so we train 8-gram
coarse LMs, whereas we only go up to 5-grams for the regular LMs.

Our recommendation, and the framework default, is to train two 8-gram coarse
LMs, one with 200 classes and one with 800 classes.

Before training the coarse LMs themselves, we need to learn word classes on the
target language vocabulary. We use Google's free, open source \code{word2vec}
tool to learn these classes from the corpora.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make wcl}
   make[2]: Entering directory `.../toy.experiment/models/wcl'
   zcat -f ../../corpora/lm-train_fr.lc.gz ../../corpora/tm-train_fr.lc.gz > all-200_fr.lc
   word2vec -cbow 0 -size 100 -window 1 -negative 0 -hs 1 -sample 0 -threads 1 \bs
      -min-count 1 -classes 200 -output fr.200.classes -train all-200_fr.lc
   sed -i -e 's/ /{\bs}t/' fr.200.classes
   rm -r all-200_fr.lc
   [commands to train en.200.classes, fr.800.classes and en.800.classes]
\end{alltt}
\end{small}

For each class file to create, we first concatenate all the input files in a temporary
uncompressed file, as required by \code{word2vec}.\footnote{For coarse
LM classes, we should only need \code{lm-train\us{}en.lc.gz}, but instead we
use all target language corpora because we want these classes to be aware of
all target-side vocabulary of training files used for any models. Although not
needed for Coarse LMs, classes are also trained on the source side because others
models need them.} Then we call \code{word2vec} with carefully tuned parameters.
Finally the resulting \code{.classes} files are
reformated to respect the standard word-tab-class format on each line and the
temporary input file is deleted.

Now that we have our word-class files, we can train the LMs themselves.
Breaking with the convention in rest of this document, we'll
interleave the commands executed by make and our comments about them, to
improve readability.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make coarselm}
   make[2]: Entering directory `.../toy.experiment/models/coarselm'

   word2class -no-error ../../corpora/lm-train_en.lc.gz ../wcl/en.200.classes | \bs
      gzip > lm-train_en-200.lc.gz
   182281 of 182281 words mapped to word classes.

   word2class -no-error ../../corpora/dev1_en.lc ../wcl/en.200.classes > dev1-200_en.lc
   57 word mapping errors.
   2058 of 2115 words mapped to word classes.
   Warning: Output contains 57 word class mapping errors (<unk>).
   [...]
\end{alltt}
\end{small}

Using \code{word2class}, we create a copy of all the input training corpus with
all the words substituted by their word class. A message tells us that all
182281 words were mapped to their classes. Since we're using the dev set for
metaparameter tuning with MITLM, we also need to do the same mapping on the dev
set, but now we see 57 mapping errors: this is due to words in the dev set not
being in the training corpus and is perfectly normal: we want to train \PS and
all its models to deal with unseen words.

Then the command we launched above proceeds to use our chosen LM toolkit to
train a language model on this sequence of classes, instead of the usual
sequence of tokens:
\begin{small}
\begin{alltt}
   Creating ARPA text format lm-train_en-200-ukn-8g.lm.gz
   zcat lm-train_en-200.lc.gz \bs
      | perl -ple 's/^{\bs}s+//; s/{\bs}s+\$//; s/{\bs}s+/ /g;' | fold --bytes --spaces --width=4095 \bs
      | estimate-ngram -order 8 -smoothing KN -text /dev/stdin \bs
           -opt-perp dev1-200_en.lc -write-lm lm-train_en-200-ukn-8g.lm.gz
   arpalm2binlm lm-train_en-200-ukn-8g.lm.gz lm-train_en-200-ukn-8g.binlm.gz
   arpalm2tplm.sh lm-train_en-200-ukn-8g.lm.gz lm-train_en-200-ukn-8g
\end{alltt}
\end{small}

The whole sequence above, starting at \code{word2class}, is repeated with 800
classes, since we train both a 200 class and an 800 class coarse LM.

When the decoder uses a coarse model, it applies the same mapping from tokens
to classes before querying the LM, giving us a probability for class
sequences. This helps with sequences that were not seen in training, but where
words are somehow similar to those in sequences that were seen. For example,
maybe ``the blue car'' was never seen in training, but ``the \emph{colour}
car'' was seen for several colours which ended up in the same class: the coarse
LM will consider all these variants equally likely if they map to the same
class sequence.


\subsection{Creating a Truecasing Model} \label{TC}

Decoding is done in lowercase to reduce data sparseness, but at the end of the
translation process, you will need to restore proper mixed case to
your output.  We call this step truecasing.  To train a truecasing model, you
need both a lowercased version and the original ``truecase'' version of the
training corpus in the target language.

The basic truecasing model consists of two different models: a ``casemap'',
which maps each lower case word to its possible truecase variants, as observed
in the training corpus, and a standard language model trained on the truecase
corpus.

The improved version of truecasing carries casing information from the source
sentence to the target sentence, including the casing for the first word in the
sentence, the casing of out-of-vocabulary words (OOVs), and unusual casing for
sequences of several words (such as all caps).

Truecasing with source information is the default.\footnote{If you want to
use basic truecasing, comment out the line \code{TC\_USE\_SRC\_MODELS=1} in the
advanced configuration section of \code{Makefile.params}.}  It requires three
language models: source-side and target-side
case-normalizing\footnote{``Normalized case'' means that the first character of
the sentence is in lowercase unless the word inherently requires the upper
case.  E.g., ``his flat in is London.'' and ``London is on the Thames.'' are in
normalized case.  The acronym ``nc1'' stands for
``normalized-cased first word'', i.e., normalized case.} language models, as
well as the main truecasing LM.  The target-side case-normalizing LM is used
only for generating a normalized-case target-side training corpus; it is not
used in translation.  The main truecasing LM and the ``casemap'' are trained on
the normalized-case target-side training corpus.  Together, they are used to
determine the case for the output of the decoder.  Then case patterns observed
in the source sentence, including the first word, are transferred to the
truecased output of the decoder.

%\TODO{Think about whether this whole section should be simplified with only the
%final model files listed, but not the commands used.}

Running \code{make tc} at the root of the framework will build all necessary
files and models.  As for Coarse LMs, we'll
interleave the commands executed by make and our comments about them, to
improve readability.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make tc}
   make[2]: Entering directory `.../toy.experiment/models/tc'
   zcat -f ../../corpora/lm-train_en.tc.gz \bs
        | perl -ne '\emph{set UTF-8 encoding} s/^[^[:lower:]]+\$/{\bs}n/; print \$_ unless /^\$/;' \bs
        | utokenize.pl -pretok -paraline -ss -p -lang=en \bs
        | gzip > lm-train_en.tokss.gz
\end{alltt}
\end{small}

We first generate \code{lm-train_en.tokss.gz}, a variant of the target-side
training corpus with all-caps sentences (which are misleading for training
truecasing models) removed and sentence splitting re-done.  Sentence splitting
is re-applied because corpus text from some sources such as TMX files may
contain lines with multiple sentences, and beginning-of-sentence detection is
important for case normalization.

Second, we generate \code{lm-train_en.revtokss.gz}, an inversion of the
target-side training corpus needed to train the target-side case-normalizing
LM:

\begin{small}
\begin{alltt}
   zcat -f lm-train_en.tokss.gz | filter-nc1.py -enc UTF-8 \bs
        | reverse.pl | gzip > lm-train_en.revtokss.gz
\end{alltt}
\end{small}

Third, we train a language model on this permuted corpus, producing
\code{lm-train_en.nc1. binlm.gz}, the target-side case-normalizing language model.
The actual commands used depend on your LM toolkit:

\begin{small}
\begin{alltt}
   Creating ARPA text format lm-train_en.nc1.lm.gz
   [commands to train case-normalizing target-language LM lm-train_en.nc1.lm.gz]
   arpalm2binlm lm-train_en.nc1.lm.gz lm-train_en.nc1.binlm.gz
\end{alltt}
\end{small}

Next, the \code{normc1} program uses the case-normalizing LM to produce
\code{lm-train_en.nc1.gz}, the nor\-malized-case target-side training corpus:

\begin{small}
\begin{alltt}
   normc1 -ignore 1 -extended -notitle -loc en_CA.UTF-8 lm-train_en.nc1.binlm.gz \bs
        lm-train_en.tokss.gz \bs
        | perl -pe 's/(.)$/$1 /; s/(.){\bs}n/\$1/' | gzip > lm-train_en.nc1.gz
\end{alltt}
\end{small}

Here we generate the casemap for the main target-side truecasing model,
\code{lm-train_en.map}, using \code{compile_truecase_map}, which compiles the
casemap by processing the normalized-case and lowercase versions of the corpus
simultaneously, and recording, for each lower case word, all the cased variants
found in the normalized-case file, along with their distribution:

\begin{small}
\begin{alltt}
   zcat -f lm-train_en.nc1.gz |  utf8_casemap -c l \bs
        | compile_truecase_map lm-train_en.nc1.gz - > lm-train_en.map
\end{alltt}
\end{small}

Then we produce \code{lm-train_en-kn-3g.binlm.gz}, the main target-size
truecasing LM, trained on the normalized-case corpus:

\begin{small}
\begin{alltt}
   Creating ARPA text format lm-train_en-kn-3g.lm.gz
   [commands to train truecasing LM lm-train_en-kn-3g.lm.gz]
   arpalm2binlm lm-train_en-kn-3g.lm.gz lm-train_en-kn-3g.binlm.gz
\end{alltt}
\end{small}

The final block of commands does the same thing as the first block, this time to
produce the source-side case-normalizing language model needed by the new
truecasing workflow, \code{lm-train_fr.nc1.binlm.gz}:

\begin{small}
\begin{alltt}
   zcat -f ../../corpora/lm-train_fr.al.gz \bs
        | utokenize.pl -pretok -paraline -ss -lang=fr \bs
        | perl -pe '\emph{set UTF-8 encoding} s/^[^[:lower:]]+(\$|( : ))//;' \bs
        | gzip > lm-train_fr.tokss.gz
   zcat -f lm-train_fr.tokss.gz | filter-nc1.py -enc UTF-8 \bs
        | reverse.pl | gzip > lm-train_fr.revtokss.gz
   Creating ARPA text format lm-train_fr.nc1.lm.gz
   [commands to train case-normalizing source-language LM lm-train_fr.nc1.lm.gz]
   arpalm2binlm lm-train_fr.nc1.lm.gz lm-train_fr.nc1.binlm.gz
\end{alltt}
\end{small}

\subsection{Creating a Translation Model} \label{TM}

Creating a translation model involves two main steps: 1) training IBM2, HMM and
IBM4 word alignment models in both directions, then 2) using them to
extract phrase pairs from the corpus.

There are many ways to combine the counts obtained from different alignments.
Through ongoing experimentation, our recommendations have changed in the past
and will likely change again; the best setup depends on your data and
resources, but we try to maintain a good general setup as the default in the
framework.

Here we illustrate the method we currently recommend: merge the counts from all
the alignment methods to estimate the main probability feature, using one
alignment method to estimate a lexical smoothing feature.
You can do all this by typing \code{make tm} in your \code{toy.experiment}
directory, but we will break it down into several steps.

By default, \code{make tm} will train IBM2 (\S\ref{IBM2}), HMM (\S\ref{HMM}),
and IBM4 (\S\ref{IBM4}) word-alignment models, and tally their counts
together into the final phrase table (\S\ref{CPT}).

In \S\ref{PI}, we'll show how you can also produce alignment indicator features
telling the system which aligner produced which phrase pairs.  The alignment
indicator features can be helpful because the different alignments make different
kinds of errors in the phrase pairs they produce; the diversity of phrase
pairs obtained from two or three separate alignment methods helps the system,
while the indicator features allow the system to learn to give more weight to
alignments suggested by the more reliable alignment method.  Now, we don't
explicitly estimate how reliable each method is, instead we let decoder tuning
learn the indicator feature weights (see \S\ref{COW}).

Later, we'll look at interpolating probability estimates
coming from various corpora, in a way that is adapted to your in-domain
material (\S\ref{MIXTM}).

\subsubsection{Creating a Translation Model Using IBM2 Alignments} \label{IBM2}

%\TODO{Think about whether I should reorganize this code around the directory
%structure instead of the alighment model: a IBM section, a WAL section, a JPT
%section, and then the TM section. This would probably be easier to follow.}

\subsubsection*{Training IBM2 Models}

First we train IBM2 word alignment models, which requires training IBM1 models
as a prerequisite.  We do this for both directions in \code{models/ibm/}.
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/ibm}
   > \textbf{make ibm2_model}
   cat.sh -n 4 -pn 4 -v -n1 5 -n2 0 -bin ibm1.tm-train.en_given_fr.gz \bs
      ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz
   cat.sh -n 4 -pn 4 -v -n1 0 -n2 5 -slen 20 -tlen 20 -bksize 20 \bs
      -bin -i ibm1.tm-train.en_given_fr.gz ibm2.tm-train.en_given_fr.gz \bs
      ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz
   cat.sh -n 4 -pn 4 -v -r -n1 5 -n2 0 -bin ibm1.tm-train.fr_given_en.gz \bs
      ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz
   cat.sh -n 4 -pn 4 -v -r -n1 0 -n2 5 -slen 20 -tlen 20 -bksize 20 \bs
      -bin -i ibm1.tm-train.fr_given_en.gz ibm2.tm-train.fr_given_en.gz \bs
      ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz
\end{alltt}
\end{small}
The \code{train_ibm} program could have been used directly, but we use
\code{cat.sh} instead to parallelize the process. The
options \code{-n 4 -pn 4} mean 4-ways parallel: the default is to use the
number of CPUs available, as controlled by \code{PARALLELISM_LEVEL_TM}
in \code{Makefile.params}.

The commands above write log files containing information about training.  For
example, the IBM log files show convergence and pruning statistics for each
iteration (use \code{grep ppx log.*en} to see perplexities for the reverse models):
\begin{footnotesize}
\begin{alltt}
   > \textbf{grep ppx log.*fr}
   log.ibm1.tm-train.en_given_fr:parallel iter (IBM1): prev ppx = 351.624, size = 1403160 word pairs.
   log.ibm1.tm-train.en_given_fr:parallel iter (IBM1): prev ppx = 108.905, size = 1332332 word pairs.
   log.ibm1.tm-train.en_given_fr:parallel iter (IBM1): prev ppx = 66.4595, size = 1178515 word pairs.
   log.ibm1.tm-train.en_given_fr:parallel iter (IBM1): prev ppx = 52.8252, size = 1015977 word pairs.
   log.ibm1.tm-train.en_given_fr:parallel iter (IBM1): prev ppx = 47.9259, size = 874907 word pairs.
   log.ibm2.tm-train.en_given_fr:parallel iter (IBM2): prev ppx = 45.7706, size = 874635 word pairs.
   log.ibm2.tm-train.en_given_fr:parallel iter (IBM2): prev ppx = 29.4135, size = 874294 word pairs.
   log.ibm2.tm-train.en_given_fr:parallel iter (IBM2): prev ppx = 23.4987, size = 850152 word pairs.
   log.ibm2.tm-train.en_given_fr:parallel iter (IBM2): prev ppx = 20.865, size = 741890 word pairs.
   log.ibm2.tm-train.en_given_fr:parallel iter (IBM2): prev ppx = 19.5697, size = 617871 word pairs.
\end{alltt}
\end{footnotesize}

The log files also contain time and memory resource information, which you can
summarize at any level just as you can globally:
\begin{footnotesize}
\begin{alltt}
   > \textbf{make time-mem}
     log.ibm1.tm-train.en_given_fr  WALL TIME: 13s  CPU TIME: 7s   PCPU: 100\%  VSZ: 0.288G  RSS: 0.017G
     log.ibm1.tm-train.fr_given_en  WALL TIME: 13s  CPU TIME: 7s   PCPU: 99\%   VSZ: 0.288G  RSS: 0.017G
     log.ibm2.tm-train.en_given_fr  WALL TIME: 11s  CPU TIME: 5s   PCPU: 100\%  VSZ: 0.286G  RSS: 0.016G
     log.ibm2.tm-train.fr_given_en  WALL TIME: 11s  CPU TIME: 5s   PCPU: 100\%  VSZ: 0.286G  RSS: 0.016G
   ibm:TIME-MEM                     WALL TIME: 48s  CPU TIME: 24s  PCPU: 100\%  VSZ: 0.288G  RSS: 0.017G
\end{alltt}
\end{footnotesize}

The IBM models are written to files \code{ibm[12].*} and contain word
translation/alignment probabilities.
The files are in a binary format that you can read with the \code{ibmcat} command,
e.g., \code{ibmcat ibm2.tm-train.en_given_fr.gz | less}.

\subsubsection*{Creating the alignment file}

Now we use the alignment models to generate the alignment of the
training corpus according to the IBM2 alignment method, to be used for creating
phrase tables and other models that rely on word alignments.
We do this in \code{models/wal/}.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/wal}
   > \textbf{make ibm2_model}
   align-words -a GDFA -o sri -ibm 2 \bs
      ../ibm/ibm2.tm-train.en_given_fr.gz ../ibm/ibm2.tm-train.fr_given_en.gz \bs
      ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz \bs
      | gzip > tm-train.ibm2.fr2en.align.gz
\end{alltt}
\end{small}

The command as shown here would run in a single thread, but you'll see it is in
fact wrapped in a call to \code{parallelize.pl} which, as the name says,
parallelizes the process.

\subsubsection*{Training Joint-Count Phrase Tables}

With the word alignments saved, extracting the joint-count phrase table (or JPT) from
the parallel corpus is reasonably quick. We're now in \code{models/jpt/}.
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/jpt}
   > \textbf{make ibm2_model}
   gen-jpt-parallel.sh -n 4 -nw 4 -o jpt.ibm2.tm-train.fr-en.gz -w 1 \bs
      GPT -v -m 8 -1 fr -2 en -ext -write-al all \bs
      ../ibm/ibm2.tm-train.en_given_fr.gz ../ibm/ibm2.tm-train.fr_given_en.gz \bs
      ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz \bs
      ../wal/tm-train.ibm2.fr2en.align.gz
\end{alltt}
\end{small}
The JPT contains the co-occurrence frequency for each phrase pair observed in
the parallel corpus.

\subsubsection*{Training Conditional Phrase Tables}

We used to train conditional phrase tables separately for the JPTs coming from
each different word alignment model, but the current best practice is to
merge them first, so we'll come back to this step after doing the HMM and IBM4
models.

\subsubsection{Creating a Translation Model Using HMM Alignments} \label{HMM}

Now we repeat the same steps using HMM word alignment models, using the
``hmm3'' variant, which is the default and recommended HMM configuration.
If you
inspect the alignments, you will find that ``hmm3'' alignments leave a lot of
words unaligned.  They were tuned
specifically to maximize BLEU scores in a complete translation system, not for
independent use.  If you want to produce alignments for display or other uses
that exploit the alignments directly, the ``hmm2'' or ``hmm1''
variant might be more suitsable.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/ibm}
   > \textbf{make hmm3_model}
   cat.sh -n 4 -pn 4 -v -n1 0 -n2 5 -hmm -end-dist -anchor \bs
      -max-jump 20 -alpha 0.0 -lambda 1.0 -p0 0.6 -up0 0.5 -bin \bs
      -i ibm1.tm-train.en_given_fr.gz hmm3.tm-train.en_given_fr.gz \bs
      ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz
   cat.sh -n 4 -pn 4 -v -r -n1 0 -n2 5 -hmm -end-dist -anchor \bs
      -max-jump 20 -alpha 0.0 -lambda 1.0 -p0 0.6 -up0 0.5 -bin \bs
      -i ibm1.tm-train.fr_given_en.gz hmm3.tm-train.fr_given_en.gz \bs
      ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz
   > \textbf{cd toy.experiment/models/wal}
   > \textbf{make hmm3_model}
   align-words -a GDFA -o sri -hmm \bs
      ../ibm/hmm3.tm-train.en_given_fr.gz ../ibm/hmm3.tm-train.fr_given_en.gz \bs
      ./../../corpora/tm-train_fr.lc.gz ./../../corpora/tm-train_en.lc.gz \bs
      | gzip > tm-train.hmm3.fr2en.align.gz
   > \textbf{cd toy.experiment/models/jpt}
   > \textbf{make hmm3_model}
   gen-jpt-parallel.sh -n 4 -nw 4 -o jpt.hmm3.tm-train.fr-en.gz -w 1 \bs
      GPT -v -m 8 -1 fr -2 en -ext -write-al all \bs
      ../ibm/hmm3.tm-train.en_given_fr.gz ../ibm/hmm3.tm-train.fr_given_en.gz \bs
      ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz \bs
      ../wal/tm-train.hmm3.fr2en.align.gz
\end{alltt}
\end{small}

There are many variants of the HMM models.  In this framework, we've included
recipes to produce three of them: one with lexically conditioned jump
parameters, following He (WMT-2006), invoked via the \code{hmm1_*} targets;
one similar to the baseline described in Liang et al (ACL 2006), invoked via
the \code{hmm2_*} targets; and one tuned in-house to maximize BLEU scores on
French-English material, invoked via the \code{hmm3_*} targets.  Many other
variants are available, as documented by \code{train_ibm -h}. To select one or
more of them, edit \code{toy.experiment/Makefile.params} and include
\code{HMM1}, \code{HMM2}, and/or \code{HMM3} in variable
\code{MERGED_CPT_JPT_TYPES}.
To use your own customized HMM parameters, it's probably easiest to modify the commands
associated with the \code{hmm1_model}, \code{hmm2_model} or \code{hmm3_model}
targets in \code{toy.experiment/models/ibm/Makefile.toolkit} than to create new targets.

\subsubsection{Creating a Translation Model Using IBM4 Alignments} \label{IBM4}

Our experiments have shown that it can be beneficial to combine the JPTs
obtained from IBM2 and HMM models with an additional one obtained from IBM4
models: the resulting systems are more robust and output better quality
translations.
IBM4 alignments are of the highest quality, so many MT groups just use IBM4 on its own,
but we find the combination performs best.
\PS does not include software to train or run IBM4 models,
however, so you must independently obtain MGiza++ and make sure it is found on
your PATH, as documented in \code{INSTALL}.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/ibm}
   > \textbf{make ibm4_model}
   [lots of commands to train IBM4 models using MGiza++]
   > \textbf{cd toy.experiment/models/wal}
   > \textbf{make ibm4_model}
   [commands to extract and save the corpus alignments produced by MGiza++]
   > \textbf{cd toy.experiment/models/jpt}
   > \textbf{make ibm4_model}
   [gen-jpt-parallel.sh is used to extract a JPT from the IBM4-aligned corpus]
\end{alltt}
\end{small}
%
Running \code{make ibm4_model} in \code{models/ibm} will invoke MGiza++ to train
IBM4 models and align the training corpus using it, performing necessary
conversions before and after to ensure compatibility with \PS software.  Our
own \code{align-words} (in \code{wal}) and \code{gen-jpt-parallel.sh} (in
\code{jpt}) are then used to produce a standard symmetrized alignment file and
extract phrase pairs from it, following the same procedure as with IBM2
and HMM models.

\subsubsection{Creating a Conditional Phrase Table} \label{CPT}

Now that we have JPTs from our IBM2, HMM and IBM4
word-alignment models, we merge their counts together in a single JPT:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/jpt}
   > \textbf{make merged_model}
   merge_multi_column_counts -a jpt.merged.tm-train.fr-en.gz \bs
      jpt.hmm3.tm-train.fr-en.gz jpt.ibm4.tm-train.fr-en.gz jpt.ibm2.tm-train.fr-en.gz
\end{alltt}
\end{small}

Finally we compute the conditional phrase table (or CPT), using two different
smoo\-thers to calculate the probabilities.\footnote{In general, we obtain
better translation quality when using phrase probabilities estimated from
occurrence counts (relative frequency, ``RFSmoother'', or a smoothed variant
such as ``KNSmoother'' or ``GTSmoother'') combined with lexical probability
estimates such as Zens-Ney smoothing (``ZNSmoother''). Other smoothing
options are also available. Experiments in recent years have shown that
using Kneser-Ney smoothed phrase probabilities with Zens-Ney smoothed lexical probabilities
gives the best results in a wide range of cases, so this is
the default.  The variables \code{CPT\us{}SMOOTHERS} and
\code{SMOOTHERS\us{}DESCRIPTION} in \code{models/tm/Makefile.params} control
the choice of smoothers.}
Although stored together in the same phrase table, they are separate probability
models, whose relative weights will be tuned during decoder weight optimization
(see \S\ref{COW}).
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/tm}
   > \textbf{make merged_model}
   joint2cond_phrase_tables -prune1w 100 -v -i -z -1 fr -2 en \bs
      -s "KNSmoother 3" -s ZNSmoother -multipr fwd -o cpt.merged.hmm3-kn3-zn.tm-train \bs
      -ibm_l2_given_l1 ../ibm/hmm3.tm-train.en_given_fr.gz \bs
      -ibm_l1_given_l2 ../ibm/hmm3.tm-train.fr_given_en.gz \bs
      -reduce-mem -no-sort -write-count -write-al top ../jpt/jpt.merged.tm-train.fr-en.gz
\end{alltt}
\end{small}
The phrase table \code{cpt.merged.hmm3-kn3-zn.tm-train.fr2en.gz} is the main source
of information for French to English translation. Each of its lines is of the
form:
\begin{alltt}
   \emph{fr} ||| \emph{en} ||| p1(\emph{fr}|\emph{en}) p2(\emph{fr}|\emph{en}) p1(\emph{en}|\emph{fr}) p2(\emph{en}|\emph{fr}) a=\emph{a} c=\emph{c}
\end{alltt}
%
where \code{\emph{fr}} is a French source phrase, \code{\emph{en}} is an
English target phrase, \code{p\emph{i}(\emph{fr}|\emph{en})} is an estimate of
the probability that \code{\emph{fr}} is the translation of \code{\emph{en}}
(the ``backward'' probability), \code{p\emph{i}(\emph{en}|\emph{fr})} is an
estimate of the probability that
\code{\emph{en}} is the translation of \code{\emph{fr}} (the ``forward''
probability), \code{\emph{a}} is the word alignment within the phrase pair, and
\code{\emph{c}} is the occurrence count of that phrase pair.
In our example here, \code{p1} is the Kneser-Ney phrase probability estimate,
while \code{p2} is the Zens-Ney lexical smoothing based
estimate: those are the smoothers we recommend using. There can be any number
of backward and forward probability estimates, reflecting the number of
smoothers you use.
%
Here are two sample lines from this file:
\begin{small}
\begin{alltt}
   > \textbf{zgrep '| proposed regulations |' cpt.merged.hmm3-kn3-zn.tm-train.fr2en.gz}
   projets de règlement ||| proposed regulations ||| 0.35 0.0028 0.52 0.022 a=0_-_1 c=6
   règlements proposés ||| proposed regulations ||| 0.015 0.023 0.045 0.39 a=1_0 c=3
\end{alltt}
\end{small}
Here we observe that the Kneser-Ney phrase probability estimate likes the
more idiomatic ``projets de règlement'' better than the more literal
``règlements proposés'', while the lexical smoother prefers the latter
because it focuses on individual words instead of the phrase. Combining the
two makes the system more robust because it can exploit the strengths of the
phrase and the lexical scoring approaches.

Note that this example is carefully chosen: because of the small size of the
training corpus, many entries in the phrase table are not good translations.

To modify the default word alignment models that go into the creation of the
CPT, change the \code{MERGED_CPT_JPT_TYPES} variable in
\code{toy.experiment/Makefile.params} to, for instance, \code{IBM2 HMM3};
this variable enumerates which word-alignment models you want to generate when
you type \code{make tm} or \code{make all} at the root level. All models you
generate will be used in subsequent steps, so generate only the ones you intend
to use.

\tip\margintip You should work in different copies of the framework if you want
to experiment with different settings.\tipend

\subsubsection{Adding alignment indicator features} \label{PI}

In our NIST OpenMT 2012 system, we found that the decoder could benefit from
knowing which alignment method suggested which phrase pairs.  With the then-new
Batch Lattice MIRA tuning method (which has been the default since then), we are able to throw more
features at the system than before, such as the alignment indicator features,
and it is able to exploit them to improve translation quality
(as measured by BLEU).  Unlike the Powell tuning algorithm,
MIRA can exploit a large number of features effectively without overfitting the
dev set.

The alignment indicator features are simply flags, one per alignment method.
To enable generating them, set
\code{MERGED_CPT_USE_ALIGNMENT_INDICATORS} to \code{1} in the top-level
\code{Makefile.params}.  Having done so, the default merged CPT file
(\code{cpt.merged...}) will not be generated, instead you will get a
\code{cpt.PI...} file with identical contents in the first three columns, plus
a fourth column containing two or three values (one per aligner), with values 1
(indicating the alignment produced the phrase pair) or 0.3 (otherwise).  The 0.3 works as
a constant arbitrary penalty that gets applied whenever a given aligner did not
produce a given phrase pair.  The penalty value is arbitrary in the phrase table,
but its weight gets tuned so as to maximize BLEU scores on your dev set
later.

To generate the CPT with alignment indicator features at this point, you can
run \code{make indicator_model}:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/tm}
   > \textbf{make indicator_model}
   mkdir -p j2m/ibm2
   cd j2m/ibm2 && ln -s -f ../../jpt.ibm2*fr-en.gz .
   mkdir -p j2m/hmm3
   cd j2m/hmm3 && ln -s -f ../../jpt.hmm3*fr-en.gz .
   mkdir -p j2m/ibm4
   cd j2m/ibm4 && ln -s -f ../../jpt.ibm4*fr-en.gz .
   joint2multi_cpt -write-count -write-al top -prune1w 100 -v -i -z -1 fr -2 en \bs
      -s '0:KNSmoother 3' -s ZNSmoother -a '1-100:PureIndicator' \bs
      -dir fwd -o cpt.PI-kn3-zn.hmm3 \bs
      -ibm_l2_given_l1  ./../ibm/hmm3.tm-train.en_given_fr.gz \bs
      -ibm_l1_given_l2  ./../ibm/hmm3.tm-train.fr_given_en.gz \bs
      j2m/ibm2 j2m/hmm3 j2m/ibm4
\end{alltt}
\end{small}

The \code{joint2multi_cpt} program used above is a versatile tool for
generating phrase tables with all sorts of features.  It can produce anything
that \code{joint2cond_phrase_tables} can, plus various other features that take into account
the provenance of each phrase pair.  Here we use it with just one particular
setting that has been proved effective empirically.

Experimental advice: in our experiments we found that indicator features can
increase the BLEU score a little, but it roughtly doubles the amount of memory
required to train the CPT. Experiment with your own data to see if it's worth
it for you, but we leave it off by default because it's often not worth it.

Important warning: if you've created both the \code{cpt.merge...} and
\code{cpt.PI...} files in the same instance of the framework, the tuning step
will use them both, which is not a good idea. You should remove one or the
other before proceeding:
\begin{small}
\begin{alltt}
   > \textbf{rm cpt.merged.*}
\emph{or}
   > \textbf{rm cpt.PI*}
\end{alltt}
\end{small}
You should also reset \code{MERGED_CPT_USE_ALIGNMENT_INDICATORS}.
You will need to work in two separate copies of the framework if you want to
experiment with both of these methods.

\subsection{Creating a Hierarchical Lexicalized Distortion Model} \label{LDM}

Hierarchical Lexicalized Distortion Models (HLDMs) are now recommended and used by default,
with \code{USE_HLDM} set to \code{1} in the default \code{Makefile.params}.
These models help the decoder reorder words more appropriately during
translation.

The framework also supports using regular (non-hierarchical) Lexicalized
Distortion Models (LDMs), but experiments have shown that HLDMs are much
more effective.  The key difference between
the two variants is how one defines the different distortion configurations.
HLDMs use a more natural definition that overcomes some of the weaknesses of
LDMs.

HLDMs and LDMs are created in two steps: 1) using \code{dmcount} to count occurrences of the
different types of distortion instances over the training corpus,
and 2) using \code{dmestm} to estimate scores from these counts.
If you trained more than one phrase table, as we did
above, the first step is repeated for each phrase table.
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make ldm}
   parallerize.pl -rp "-j 4" -n 4 -np 4 -w 100000 \bs
      -s ../../corpora/tm-train_fr.lc.gz -s ../../corpora/tm-train_en.lc.gz \bs
      -s ./../wal/tm-train.hmm3.fr2en.align.gz \bs
      -merge 'merge_multi_column_counts -' \bs
      'dmcount -ext -v -m 8 -hier \bs
         ../../corpora/tm-train_fr.lc.gz ../../corpora/tm-train_en.lc.gz \bs
         ./../wal/tm-train.hmm3.fr2en.align.gz | li-sort.sh > hldm.hmm3.counts.gz'
   parallelize.pl [...] 'dmcount [...] > hldm.ibm4.counts.gz'
   parallelize.pl [...] 'dmcount [...] > hldm.ibm2.counts.gz'
   merge_multi_column_counts -fillup - \bs
      hldm.ibm4.counts.gz hldm.hmm3.counts.gz hldm.ibm2.counts.gz \bs
      | dmestm -s -g hldm.ibm4+hmm3+ibm2.fr2en.bkoff \bs
        -wtu 5 -wtg 5 -wt1 5 -wt2 5 \bs
      | gzip > hldm.ibm4+hmm3+ibm2.fr2en.gz
\end{alltt}
\end{small}

Estimating lexicalized distortion parameters (step 2, \code{dmestm}) might
require a lot of memory, sometimes two to three times as much as training
phrase tables.  As mentioned before, you can use \code{make time-mem} to see
your resource usage after the process has completed to plan your resource
allocation.

If you need to reduce the memory footprint of \code{dmestm}, you can turn on
singleton filtering to remove all phrase pairs
that occur only once from the HLDM count file.  As a result, you will need
a lot less memory, but the model might not be as good. It's a trade-off that
you can measure empirically on your data.  To turn on singleton filtering, edit
\code{models/ldm/Makefile.params} and uncomment the line defining
\code{HLDM_FILTER_SINGLETONS}.

\tip\margintip The commands for step 1 illustrate the use of one of our generic
utilities: \code{parallelize.pl}.  This script can be used to parallelize the
execution of any command where each input line is processed independently.
Run \code{parallelize.pl -h} for more details.\tipend

\subsection{Sparse Models} \label{sparse}

\PS 3.0 introduced sparse models, one our research successes.
Taking advantage of MIRA (see \S\ref{COW}) and its ability to deal with a lot
more features than our previous tuning algorithm, the sparse models introduce a
large number of decoder features that support finer-grained
decisions about what constitutes a good translation. The main sparse feature
set constitutes the Discriminative Hierarchical Distortion Model (DHDM), which
can, for example, make reordering choices based on the presence of a highly
frequent word at the beginning or at the end of a phrase pair. Thus, the model
can learn that a particular preposition triggers reordering. Other sparse
features pay attention to various characteristics of phrase pairs that might
indicate how reliable they are.\footnote{The annotated bibliography included
with the \PS user manual summarizes the literature on sparse features,
discussing how they work and what they mean. See, in particular, Hopkins and
May (EMNLP 2011), Cherry and Foster (NAACL 2012), and Cherry (NAACL-HLT 2013).
\code{doc/sparse-features.README} provides the nitty-gritty of
sparse features in \PS.}

Turning on sparse features increases the quality of the translations produced
by \PS, at a minimal computing cost. It is always recommended and is the
default.\footnote{By default, only the
fast sparse features sets are enabled: \code{hopmay} and \code{fastdhdm}. A
third set is available, \code{otherdhdm}, but it significantly slows down
decoding and is not worth the very small gain it provides. The choice of sparse
features can be changed by setting the variable \code{SPARSE\us{}FEATURES} in
\code{toy.experiment/models/sparse/Makefile.params}.}

Training sparse models:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make sparse}
   make[2]: Entering directory `.../toy.experiment/models/sparse'
   cat sparse-specs-fastdhdm sparse-specs-hopmay > model.specs
   rm -f data.* build-model.cmds model
   build-sparse-model.sh -cmdfile build-model.cmds -model model -config model.specs \bs
      -corp ../../corpora/tm-train \bs
      -srclang fr -tgtlang en -srcext lc.gz -tgtext lc.gz
   run-parallel.sh -j 4 build-model.cmds 4 >& log.data
   echo "This file exists just to make dependencies work." > data
   palminer -v -pop -m model >& log.model.templates
   sed -i 's/\.mmcls\$//' model.templates
\end{alltt}
\end{small}

To build sparse models, we first list all the sparse feature sets we want in
\code{model.specs} by concatenating the relevant template spec files.

Then we use \code{build-sparse-model.sh} to list the feature
templates implied by the feature sets we chose (see file \code{model}), and to build all
the data files required for those feature templates. Well, not quite: it
outputs \code{build-model.cmds}, which lists all the commands to run to build
these data files, and then we use \code{run-parallel.sh} to run them in
parallel.

Finally, \code{palminer} expands the model into its complete list
of features, enumerating the features for each template, saving the
results in several files matching \code{model.*}.


If you're interested in the inner-workings of sparse features, it's worth
examining the \code{model.specs} file. Each line gives a feature template and
the type of data that will be needed to enumerate all feature instances:

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/sparse}
   > \textbf{cat model.specs}
   [DistCurrentSrcFirstWord] auto:mkcls:<corp>_<srclang>.<srcext>:20:2
   [DistCurrentSrcFirstWord] auto:mkcls:<corp>_<srclang>.<srcext>:50:2
   [DistCurrentSrcFirstWord] auto:svoci:<corp>:80
   [DistCurrentSrcLastWord]  auto:mkcls:<corp>_<srclang>.<srcext>:20:2
   [DistCurrentSrcLastWord]  auto:mkcls:<corp>_<srclang>.<srcext>:50:2
   [DistCurrentSrcLastWord]  auto:svoci:<corp>:80
   [DistCurrentTgtFirstWord] auto:mkcls:<corp>_<tgtlang>.<tgtext>:20:2
   [DistCurrentTgtFirstWord] auto:mkcls:<corp>_<tgtlang>.<tgtext>:50:2
   [DistCurrentTgtFirstWord] auto:tvoci:<corp>:80
   [DistCurrentTgtLastWord]  auto:mkcls:<corp>_<tgtlang>.<tgtext>:20:2
   [DistCurrentTgtLastWord]  auto:mkcls:<corp>_<tgtlang>.<tgtext>:50:2
   [DistCurrentTgtLastWord]  auto:tvoci:<corp>:80
   [AlignedWordPair]            auto:svoc:<corp>:80 auto:tvoc:<corp>:80
   [LexUnalTgtAny]              auto:tunal:../jpt:50
   [PhrasePairCountMultiBin]    auto:bins11
   [PhrasePairAllBiLengthBins]  7 7
   [PhrasePairAllTgtLengthBins] 7
   [PhrasePairAllSrcLengthBins] 7
\end{alltt}
\end{small}

The \code{build-model.cmds} files, as mentioned, lists the actual commands
to build the data files required to define all these features:

\begin{small}
\begin{alltt}
   > \textbf{cat build-model.cmds}
   [The following output is abbreviated to highlight the key elements:]
   count_unal_words -l 2 .../jpt.* | sort -nr -k2,2 | head -50 | cut -d' ' -f1 > data.tunal_.._jpt_50
   echo -e '1 1\bs{}n2 2\bs{}n3 4\bs{}n5 8\bs{}n9 16\bs{}n17 32\bs{}n33 64\bs{}n...\bs{}n10001 0' > data.bins11
   get_voc -sc ../../corpora/tm-train_en.lc.gz | head -80 | cut -d' ' -f1 > data.tvoc...
   get_voc -sc ../../corpora/tm-train_en.lc.gz | head -80 | cut -d' ' -f1 | ... > data.tvoci...
   get_voc -sc ../../corpora/tm-train_fr.lc.gz | head -80 | cut -d' ' -f1 > data.svoc...
   get_voc -sc ../../corpora/tm-train_fr.lc.gz | head -80 | cut -d' ' -f1 | ... > data.svoci...
   mkcls -c20 -n2 -Vdata.mkcls...tm-train_en.lc.gz_20_2 -p<(zcat -f .../tm-train_en.lc.gz) opt
   mkcls -c20 -n2 -Vdata.mkcls...tm-train_fr.lc.gz_20_2 -p<(zcat -f .../tm-train_fr.lc.gz) opt
   mkcls -c50 -n2 -Vdata.mkcls...tm-train_en.lc.gz_50_2 -p<(zcat -f .../tm-train_en.lc.gz) opt
   mkcls -c50 -n2 -Vdata.mkcls...tm-train_fr.lc.gz_50_2 -p<(zcat -f .../tm-train_fr.lc.gz) opt
\end{alltt}
\end{small}

These commands build frequency bins, vocabularies of most frequent source and
target words, classes of source and target words, and the list of most
frequently unaligned words, as required by one or the other of the sparse
feature templates.

\subsection{Creating and fine tuning NNJMs} \label{NNJM}

In 2014, Jacob Devlin and his colleagues at BBN developed a neural feature that
significantly improved the performance of phrase-based, statistical machine
translation: the Neural Network Joint Model (NNJM) (Devlin et al, 2014).
The NRC has reimplemented this model with success and made it available as a
decoder feature in PortgageII 3.0, and now, with PortageII 4.0, we provide the
training software as well. This makes \PS a hybrid neural/statistical
machine translation system with the phrase-based decoder having access to this
powerful neural feature to guide the search.

You can use or train NNJMs in four different ways:
\begin{itemize}
   \item Use NRC's generic NNJM on its own (no training required);
   \item Fine tune NRC's generic NNJM on your in-domain data;
   \item Train your own NNJM in one pass on all you data (in or out of domain); or
   \item Perform two-pass training on your own data:
      \begin{itemize}
         \item first pass: use all the parallel data you can get to create your own generic NNJM,
         \item second pass: fine-tune the first-pass model on your in-domain data only.
      \end{itemize}
\end{itemize}

To get the most out of NNJMs, we recommend always using two-pass
training, starting either with NRC's generic NNJM or with your own generic
NNJM, and fine-tuning it with your in-domain data.

\subsubsection{First-pass training}

To enable the use of NRC's generic NNJM\footnote{
   We provide a generic English$\rightarrow$French and a generic French$\rightarrow$English NNJM model.
} or your own pre-trained
generic NNJM, define \code{NNJM_PRETRAINED_NNJM} in the top-level
\code{Makefile.params}:
\begin{small}
\begin{alltt}
   NNJM_PRETRAINED_NNJM ?= \$\{PORTAGE_GENERIC_MODEL\}/generic-2.0/nnjm/nnjm.generic-
2.0.\$\{SRC_LANG\}2\$\{TGT_LANG\}.mm/model
\end{alltt}
\end{small}
or
\begin{small}
\begin{alltt}
   NNJM_PRETRAINED_NNJM ?= /path/to/some/other/framework/models/nnjm/trained/model
\end{alltt}
\end{small}

To create your own generic NNJM, assemble all the parallel data you can get
into a single corpus file pair in the \code{corpora/} directory, e.g.,
\code{corpora/generic-nnjm-train_en.raw.gz} and
\code{corpora/generic-nnjm-train_fr.raw.gz} and set \code{NNJM_TRAIN_CORPUS}:
\begin{small}
\begin{alltt}
   NNJM_TRAIN_CORPUS ?= generic-nnjm-train
\end{alltt}
\end{small}

If you stop here and run \code{make nnjm} or \code{make all} at the
framework root, you will use a one-pass NNJM.

Note also that changing the NNJM training corpora affects how word class files are generated.
If you created the word classes as instructed in \S\ref{coarseLM} above, run \code{make clean} in
\code{models/wcl/} before running \code{make nnjm} or \code{make all}.

\subsubsection{Second-pass training}

To enable the second pass of NNJM training on your in-domain data and maximize the value of your NNJM,
you must define \code{NNJM_FINE_TUNING_TRAIN_CORPUS}:
\begin{small}
\begin{alltt}
   NNJM_FINE_TUNING_TRAIN_CORPUS ?= nnjm-train
\end{alltt}
\end{small}
Like for sparse features, this should be one file pair which contains all of
your in-domain corpora concatenated together.

\subsubsection{Continuing the tutorial}

To continue this tutorial, let's use NRC's generic NNJM for the first pass, and
the TM training corpus \code{tm-train} for the second pass:
\begin{small}
\begin{alltt}
   NNJM_TRAIN_CORPUS ?=   # leave this line empty or commented out
   NNJM_FINE_TUNING_TRAIN_CORPUS ?= tm-train
   NNJM_PRETRAINED_NNJM ?= \$\{PORTAGE_GENERIC_MODEL\}/generic-2.0/nnjm/nnjm.generic-
2.0.\$\{SRC_LANG\}2\$\{TGT_LANG\}.mm/model
\end{alltt}
\end{small}

Before you proceed, you must have installed Theano and its dependencies, as
documented in the user manual (see \code{doc/user-manual/TheanoInstallation.html}).

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make nnjm}
   [commands in corpora, ibm, wal and wcl to make sure all dependencies were built]
   [commands in models/nnjm for error checking]
   train-nnjm.sh -out fine_tuned -pre-trained-nnjm pretrained.nnjm.generic-2.0.fr2en.mm/ \bs
      -dev ../wal/dev1/fr.lc ../wal/dev1/en.lc .//../wal/dev1.ibm4.fr2en.align.gz \bs
      ../ibm/tm-train/fr.lc ../ibm/tm-train/en.lc .//../wal/tm-train.ibm4.fr2en.align.gz \bs
      &> log.train-nnjm.fine_tuning
\end{alltt}
\end{small}

While this runs, the interesting log files are \code{models/nnjm/log.train-nnjm.fine_tuning}
and \code{models/nnjm/train-nnjm.workdir-XXX/log.train-nnjm.py}, where XXX will
vary at each run. log.train-nnjm.py should look something like this:
\begin{footnotesize}
\begin{alltt}
   > \textbf{cat models/nnjm/train-nnjm.workdir-}\emph{XXX}\textbf{/log.train-nnjm.py}
   Using cuDNN version 5005 on context None
   Mapped name None to device cuda: GeForce GTX TITAN (0000:05:00.0)
   Reading data
   read 2215 examples from train-nnjm.workdir-Z2d/dev-ex.gz
   Warning: svoc_size too small (0); raising it to 16400
   Warning: tvoc_size too small (0); raising it to 16387
   Warning: ovoc_size too small (0); raising it to 32396
   Trying to loading symbolic variables from file a binary file format: pretrained.nnjm.generic-2.0.fr2en+pkl.mm//nnjm.bin
   Successfully loaded symbolic variables.
   [...]
   Optimizing
   ... epoch size = 189292 (100\% train data), slice size = 64000 (2 per epoch), minibatch size = 128
   ... epoch 1: 31 secs, eta = 0.030000, nlls: train = 1.957493, val = 2.055192, test = 0.000000
   ... epoch 2: 29 secs, eta = 0.030000, nlls: train = 1.901798, val = 2.037993, test = 0.000000
   ... epoch 3: 29 secs, eta = 0.030000, nlls: train = 1.866922, val = 2.026534, test = 0.000000
   ... epoch 4: 29 secs, eta = 0.030000, nlls: train = 1.835368, val = 2.022086, test = 0.000000
   [...]
   ... epoch 14: 30 secs, eta = 0.030000, nlls: train = 1.618010, val = 2.002129, test = 0.000000
   ... epoch 15: 30 secs, eta = 0.030000, nlls: train = 1.614075, val = 1.999671, test = 0.000000
   ... epoch 16: 30 secs, eta = 0.015000, nlls: train = 1.591829, val = 1.998804, test = 0.000000
   ... epoch 17: 31 secs, eta = 0.015000, nlls: train = 1.570838, val = 1.998533, test = 0.000000
   validation error stable - quitting
   Writing to fine_tuned/nnjm.pkl
\end{alltt}
\end{footnotesize}

Things to look for in \code{log.train-nnjm.py}:
\begin{itemize}
   \item ``Using cuDNN'' and the name of your GPU device in the first two lines
      confirm that training is happening on GPU, not on CPU.
   \item Up to 60 epochs showing training and validation perplexities (shown
      next to ``train ='' and ``val ='') progressively going down at each
      iteration.
   \item Validation perplexity is the perplexity of your dev set
      (\code{NNJM_FINE_TUNING_DEV_CORPUS}). When it has not gone down for four
      consecutive iterations, training stops.
   \item Perplexity values are not very meaningful, they can vary wildly from
      one context to another. The downward trend is all we're looking for.
      Validation perplexities can be compared only if you are using exactly the
      same dev set in two experiments.
   \item NNJM training normally takes hours on GPU, days on CPU, even for the
      fine-tuning pass. It should take around 10 minutes on GPU with the tiny
      dataset in this tutorial, or a couple hours on CPU.
\end{itemize}

\subsection{Mixture Models and Domain Adaptation} \label{MIX}

In the previous sections, we assume that the training data for
language and translation models consists of a single corpus.  This is often too
simplistic: you may have data that comes from different domains, for example
data from web material vs.\ newswire, or from different translation memories.
You might also have a large amount of generic data that you want to combine
with smaller, in-domain data.  There are several ways to combine these corpora
into a system that is adapted to a particular domain while taking advantage of
data from other domains.

The preferred method for combining these corpora is the use of mixture language
and translation models.
Mixture models use a linear combination of their component models, which has
the effect of giving a reasonably high score to translation hypotheses that at least
one of their components likes.  The linear combination allows models to
be used in a kind of fall-back way: if the in-domain model knows the text and
likes it, we accept it, but if the in-domain model is faced with unseen
material, we'll still trust the other models and accept text they find
acceptable.  Alternatively, you could combine multiple LMs and TMs in the
log-linear model: in that case, the models all have to agree that a hypothesis
is good before it gets a good score overall.  Thus, the log-linear combination
is more like a system where each model has a veto on all
decisions.\footnote{This characterisation is over-simplistic: decoder weight
tuning (\S\ref{COW}) still adjusts how much the decoder will care about each
model's opinion.}

For the tutorial, let's assume
you have three domains or translation memories that you will use
together to create a system specialized for the first domain.  The first
training corpus, as well as the dev and test sets, should come from the primary
domain (in-domain text), while the other training corpora are from the other two
domains (out-of-domain text).  We assume that new text you plan to submit to
this system will be in domain, which is why we optimize the system for
in-domain text.

To simulate this scenario, we've added three tiny corpora to
\verb"$"\url{PORTAGE/test-suite/tutorial-data/mix}.
These toy files don't actually come from different domains, but we'll pretend they do
for the sake of our tutorial, and we'll pretend only the first one is in
domain.  You should copy these files into your \code{corpora} directory, adding
them to the files already there:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{cp \$PORTAGE/test-suite/tutorial-data/mix/*.raw* corpora/}
   > \textbf{ls corpora/*train?_*}
   corpora/lm-train1_en.raw.gz  corpora/tm-train1_en.raw.gz
   corpora/lm-train1_fr.raw.gz  corpora/tm-train1_fr.raw.gz
   corpora/lm-train2_en.raw.gz  corpora/tm-train2_en.raw.gz
   corpora/lm-train2_fr.raw.gz  corpora/tm-train2_fr.raw.gz
   corpora/lm-train3_en.raw.gz  corpora/tm-train3_en.raw.gz
   corpora/lm-train3_fr.raw.gz  corpora/tm-train3_fr.raw.gz
\end{alltt}
\end{small}

Mixture models are meant to replace regular LMs and TMs. In a given setup, you
should either follow the instructions in \S\ref{LM} or those in \S\ref{MIXLM}
below, but not both
(i.e., define only one of \code{PRIMARY_LM}, \code{TRAIN_LM} or \code{MIXLM});
you should either follow \S\ref{TM} or \S\ref{MIXTM}, but
not both
(i.e., define only one of \code{TRAIN_TM} or \code{MIXTM}).
If you create regular LMs and
TMs as well as mixture LMs and TMs, the software will run anyway, but tuning and
decoding will pick up and use both sets of models, which is not usually
desirable.

Since \PS 3.0, our recommendation is that you should always use the generic LM as
background model if it is available for your target language.  The framework
has variable \code{PRIMARY_LM} to support this recommendation: if the generic
LM exists, it is combined as a MixLM with your in-domain LM; otherwise your
in-domain LM is used on its own.
If you specify several corpora in \code{PRIMARY_LM}, they will be combined in a MixLM,
with the generic LM added in if it exists.


\subsubsection{Mixture Language Models} \label{MIXLM}

The first mixture models we will look at are the mixture language models
(MixLMs).  To tell the framework to use them, edit
\code{Makefile.params} and modify the \code{PRIMARY_LM} variable to list the stems
of the three training sets for LMs: \code{lm-train1 lm-train2 lm-train3}.  For
Mixture LMs, it's also often useful to include an additional language model
trained on the union of all the corpora together.  Here, \code{lm-train} is
in fact the union of the three pretend domains, so we'll throw in
\code{lm-train} at the end of \code{PRIMARY_LM}:
\begin{small}
\begin{alltt}
   PRIMARY_LM = lm-train1 lm-train2 lm-train3 lm-train
\end{alltt}
\end{small}

Having done this, you should comment out the definition of \code{TRAIN_LM} and \code{MIXLM} if they were defined.
Note: to exclude NRC's generic LM at this step, use \code{MIXLM} instead and leave \code{PRIMARY_LM}
undefined.\footnote{
When you change the values of \code{PRIMARY\us{}LM}, \code{TRAIN\us{}LM} and/or
\code{MIXLM} in a framework where you already did some training, you should run
\code{make clean} in directories \code{models/lm} and \code{models/mixlm}, otherwise
results might not be correct.
}

This change also implicitly changes the definitions of \code{TRAIN_TC} and
\code{TRAIN_COARSELM}. \code{TRAIN_COARSELM} will now default to all the corpora
listed in \code{PRIMARYLM} (or \code{MIXLM}), thus would train four pairs of coarse LMs.
We recommend training coarse LMs only on the
union of the corpora (i.e., the global corpus) by explicitly setting:
\begin{small}
\begin{alltt}
   TRAIN_COARSELM = lm-train
\end{alltt}
\end{small}
\code{TRAIN_TC} will now default to \code{lm-train1}, the first corpus listed in
\code{MIXLM}, which means using just the in-domain corpus for truecasing. You
might want to explicitly set \code{TRAIN_TC} to \code{lm-train} to use the global corpus
instead for truecasing.

With the variables defined as discussed, training of the MixLM is automatic,
but we'll break it down a bit.
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make corpora}
   [commands to tokenize and lowercase lm-train?_*]
   > \textbf{cd toy.experiment/models/lm}
   > \textbf{make clean}
   [removes trained LMs in models/lm, so they don't interfere with the mixlm]
   > \textbf{cd toy.experiment/models/mixlm}
   > \textbf{make clean}
   > \textbf{make all}
   [commands to create lm-train1_fr-kn-5g.lm.gz and .tplm]
   [commands to create lm-train2_fr-kn-5g.lm.gz and .tplm]
   [commands to create lm-train3_fr-kn-5g.lm.gz and .tplm]
   [commands to create lm-train_fr-kn-5g.lm.gz and .tplm]
   ln -sf /home/portage/PortageII-4.0/generic-model/generic-2.0/lm/generic-2.0_fr.tplm .
   echo "`basename models/mixlm/lm-train1_fr*.tplm` \bs
         `basename models/mixlm/lm-train2_fr*.tplm` \bs
         `basename models/mixlm/lm-train3_fr*.tplm` \bs
         `basename models/mixlm/lm-train_fr*.tplm`" \bs
         "generic-2.0_fr.tplm" \bs
      | tr " " "\bs{}n" > components_fr
   mx-calc-distances.sh -v em components_fr ../../corpora/dev1_fr.lc > dev1.distances
   mx-dist2weights -v normalize dev1.distances > dev1.weights
   [...]
\end{alltt}
\end{small}

The first command above, \code{make corpora}, asks the framework to tokenize and
lowercase the new files we just added there.  \code{make clean} in
\code{lm} and \code{mixlm} removes any LMs created before. Then \code{make all}
in \code{mixlm} does
the real work.  It creates the three source-side per-domain language models, as
well as the global one.  The source-side language models are used to tune the
weights of the component LMs in the MixLM: \code{mx-calc-distances.sh} uses the
EM algorithm to find the component weights that allow the source-side MixLM to
best model the source side of our dev set.  Since the
dev set is in domain, we typically expect the in-domain component to get the
highest weight.

\begin{small}
\begin{alltt}
   [commands to create lm-train1_en-kn-5g.lm.gz and .tplm]
   [commands to create lm-train2_en-kn-5g.lm.gz and .tplm]
   [commands to create lm-train3_en-kn-5g.lm.gz and .tplm]
   [commands to create lm-train_en-kn-5g.lm.gz and .tplm]
   ln -sf /home/portage/PortageII-4.0/generic-model/generic-2.0/lm/generic-2.0_en.tplm .
   echo "`basename models/mixlm/lm-train1_en*.tplm` \bs
         `basename models/mixlm/lm-train2_en*.tplm` \bs
         `basename models/mixlm/lm-train3_en*.tplm` \bs
         `basename models/mixlm/lm-train_en*.tplm`" \bs
         "generic-2.0_en.tplm" \bs
      | tr " " "\bs{}n" > components_en
   mx-mix-models.sh mixlm dev1.weights components_en \bs
      ../../corpora/dev1_fr.lc > dev1.mixlm
\end{alltt}
\end{small}

Having tuned the weights on the source-side, the commands above create the target-side component LMs
and write the mixture model is written by pasting the weights in
\code{dev1.weights} onto the target-side components in \code{components_en}:
\begin{small}
\begin{alltt}
   > \textbf{cat dev1.mixlm}
   lm-train1_en-kn-5g.tplm 0.000108441
   lm-train2_en-kn-5g.tplm 0.0019232
   lm-train3_en-kn-5g.tplm 0.0785128
   lm-train_en-kn-5g.tplm  0.073998
   generic-2.0_en.tplm     0.845458
\end{alltt}
\end{small}

We would have expected the in-domain component,
\code{lm-train1}, to have
the highest weight.  This didn't actually happen here because our toy data
violates our assumptions: the EM algorithm recognized that the generic LM best
models the dev set and gave it the highest weight, followed somewhat arbitrarily by
\code{lm-train3} and \code{lm-train}.  With real data,
you should get more reasonable results.

The file \code{dev1.mixlm} is our mixture language model, adapted for the
dev set \code{dev1}.  When we tune decoder weights, we'll use it as is.  When
we use this model on new data, we usually use the weights tuned on
\code{dev1}: we call this using a static mixture model, which works well as
long as the new data is similar enough to the dev data.  We could also tune
mixture weights specifically for the new document or test set: we call this
using a dynamic mixture model because the weights are dynamically adjusted to
new text, which works well as long as the new document is large enough.

\subsubsection{Mixture Translation Models} \label{MIXTM}

Just as we linearly combine LMs into a mixture language model, we can linearly
combine TMs that come from several corpora into a mixture translation model
(MixTM).  To enable MixTMs in the framework, edit \code{Makefile.params} and
set \code{MIXTM} to the three per-domain corpora:
\begin{small}
\begin{alltt}
   MIXTM = tm-train1 tm-train2 tm-train3
\end{alltt}
\end{small}

Here, putting the in-domain corpus first is critical:
the word-alignment model from the first corpus is assumed to be
in domain, while the others are not.

As with MixLMs, you should keep only one TM variable active: either \code{MIXTM} or \code{TRAIN_TM}, but not both.
So you should remove or comment out the definition of \code{TRAIN_TM}, since the
MixTM replaces the regular TM.
This disables the default definition of \code{TRAIN_SPARSE}
so we must set it explicity to the concatenation of all the parallel corpora:
\begin{small}
\begin{alltt}
   TRAIN_SPARSE = tm-train
\end{alltt}
\end{small}

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make corpora}
   [commands to lowercase tm-train?_*]
   > \textbf{cd toy.experiment/models/tm}
   > \textbf{make clean}
   [removes trained TMs in models/tm, so they don't interfere with the mixtm]
   > \textbf{cd toy.experiment/models}
   > \textbf{make tm}
   [commands to create IBM1, IBM2, HMM, and IBM4 models for each component corpus, in ibm]
   [commands to create *.align.gz files for each corpus and alignment method, in wal]
   [commands to create jpt.* files for each corpus, for each alignment method, in jpt]
   [in jpt/:]
   merge_multi_column_counts -a jpt.merged.tm-train1.fr-en.gz \bs
      jpt.ibm4.tm-train1.fr-en.gz jpt.hmm3.tm-train1.fr-en.gz jpt.ibm2.tm-train1.fr-en.gz
   [repeat merged_multi_column_counts for tm-train2, tm-train3]

   [in tm/mixtm/:]
   joint2cond_phrase_tables -prune1w 100 -v -i -z -1 fr -2 en \bs
      -s "KNSmoother 3" -s ZNSmoother -multipr fwd \bs
      -o cpt.merged.hmm3-kn3-zn.tm-train1 \bs
      -ibm_l2_given_l1  ../../ibm/hmm3.tm-train1.en_given_fr.gz \bs
      -ibm_l1_given_l2  ../../ibm/hmm3.tm-train1.fr_given_en.gz \bs
      -reduce-mem -no-sort -write-count -write-al top \bs
      ../../jpt/jpt.merged.tm-train1.fr-en.gz
   [repeat joint2cond_phrase_tables for tm-train2 and tm-train3]
   [...]
\end{alltt}
\end{small}

The MixTM training process can take a long time, because a lot of models and
files are created. In particular, separate alignment and JPT files are created
for the full matrix of alignment methods and component corpora.  These are
created in the same way as in \S\ref{TM}, so we left out most of the details
here: create word-alignment models; use them to word-align the corpora;
generate JPTs; combine the counts from the JPTs and finish with
\code{joint2cond_phrase_tables} to produce CPTs with probabilities estimated
from each component corpus, separately.

Note that \code{make} will not run the commands in the order shown here: we've
reordered them so that the discussion can follow the high-level logic of the MixTM
model, whereas \code{make} does everything in one directory before moving
to the next.

\begin{small}
\begin{alltt}
   [in wal/ and jpt/: commands to create *.align.gz and jpt.* for the dev1 set,
    using IBM2, HMM3 and IBM4 models from tm-train1, assumed to be in domain]
   [in jpt/:]
   merge_multi_column_counts -a jpt.merged.dev1.fr-en.gz \bs
      jpt.ibm4.dev1.fr-en.gz jpt.hmm3.dev1.fr-en.gz jpt.ibm2.dev1.fr-en.gz
   [in tm/mixtm/:]
   train_tm_mixture -write-count -write-al top -v -o cpt.mix.fr2en.gz \bs
      cpt.merged.hmm3-kn3-zn.tm-train1.fr2en.gz \bs
      cpt.merged.hmm3-kn3-zn.tm-train2.fr2en.gz \bs
      cpt.merged.hmm3-kn3-zn.tm-train3.fr2en.gz \bs
      ../../jpt/jpt.merged.dev1.fr-en.gz
   ln -sf mixtm/cpt.mix.fr2en.gz .
\end{alltt}
\end{small}

Now that we have one CPT for each component corpus, we need to determine how to
weigh them in the linear combination.  We generate another JPT, this one on our
dev set, using the word-alignment models from the in-domain
component.\footnote{We can't use the dev set itself to create these
word-alignment models, because it is much too small for that purpose.  That's
why we want to use the in-domain training corpus, which is assumed to be the
first component.}  Then we use the EM
algorithm (as for MixLMs) to find weights that maximize the probability of the
phrase pairs in the dev JPT according to the mixture model.  The final result
should give higher weights to components that are in-domain or closely related,
while keeping all information from out-of-domain or generic components.

The result of this combination is written to \code{cpt.mix.fr2en.gz}, in
the standard CPT format, so that the decoder does not need to be aware that
this is a MixTM rather than a regular TM.

You can find out the weights of each component for each CPT column by
inspecting \code{train_tm_mixture}'s log file:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/tm}
   > \textbf{grep wts mixtm/log.cpt.mix.fr2en}
   column 1 wts = 0.0711538 0.116627 0.812219
   column 2 wts = 0.0755416 0.112056 0.812402
   column 3 wts = 0.0599672 0.118161 0.821872
   column 4 wts = 0.0643705 0.108712 0.826918
\end{alltt}
\end{small}

Again, we see that our toy data does not meet the MixTM assumptions: for each
of the four columns, the third component model (the CPT built from
\code{tm-train3}) got the highest weight by far. You would normally expect to
see the highest weight assigned to your in-domain component.

\subsubsection{Generic and Pre-trained Models} \label{Generic}

%\TODO{This section is too hard to use. We need a concrete and explicit example, at least showing the use of generic 2.0 as background LM.}

So far, except for \S\ref{LM+generic-default}, we assumed that all the models
are trained locally, in the current instance of the framework.  Sometimes, you
might have models already trained in
other instances of the framework, or coming from generic models for which you
may not have the training corpus.  In that case, you can add the name(s) of the
target-side pre-trained language model(s) in \code{.tplm} format to
\code{MIXLM_PRETRAINED_TGT_LMS}, and they will be included in the mixture
language model, in addition to the components listed in the \code{MIXLM}
variable.  The source-side pre-trained language models are also required,
to tune the mixture weights, and should have the same name except for the
language code.

Similarly, generic phrase tables and pre-trained phrase tables from other
domains can be included in your MixTMs by adding the name(s) of the of the CPT
file(s) to \code{MIXTM_PRETRAINED_TMS}.  In this case all the CPTs (the one
trained in the framework and the pre-trained ones) must have the same number of
scores, which is four by default in the framework, as discussed in \S\ref{TM}.
Also, the in-domain phrase table cannot be pre-trained, and its corpus must
still be the first one listed in \code{MIXTM}.

\subsection{Optimizing Decoder Weights} \label{COW}

The language and translation models are the main sources of information used
for translation, which is performed by the \code{canoe} decoder.  Optionally,
we might have trained other models as additional sources of information.
To get reasonable
translation quality, the weights on all sources of information
need to be tuned. The tuning process is carried out by a script called
\code{tune.py}, which runs \code{canoe} several times, refining the weights at
each iteration.  \code{tune.py} takes an initial \code{canoe.ini} configuration
file and a dev set (with reference translations), and produces a tuned
configuration file, \code{canoe.ini.cow}, optimized to maximize BLEU on that
dev set.

We use Batch Lattice MIRA\footnote{In this document, when we say ``MIRA'',
we mean ``Batch Lattice MIRA'' unless otherwise specified.}, a state-of-the-art
tuning algorithm developed at NRC, to perform decoder weight optimization.
\code{tune.py} supports MIRA and
a variety of other tuning algorithms.
MIRA has several advantages over our previous tuning algorithm, Powell's:
it supports much larger feature sets, it is much more stable, and it
consistently finds better decoder weights, i.e., weights which give us better
test BLEU scores.

You can run all the steps below by typing \code{make decode} in your main
\code{toy.experiment} directory, but we'll break it down into several steps
here.

\subsubsection{The Decoder Configuration File: \code{canoe.ini}}

In this framework, we start with a template configuration file:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/decode}
   > \textbf{cat canoe.ini.template}
   [ttable-multi-prob]
     <CPTS>
   [lmodel-file]
     <LMS>
   [use-ftm]
   ...
\end{alltt}
\end{small}

The framework will automatically insert the phrase tables, language models,
(H)LDMs, and NNJMs generated in the previous steps into the \code{canoe.ini}
file when you make it.  It supports four template placeholders that will be
replaced automatically:
\begin{itemize}
\item \code{<SL>}:   source language code, e.g. ``fr'';
\item \code{<TL>}:   target language code, e.g., ``en'';
\item \code{<CPTS>}: all CPTs trained above and found in \code{models/tm/}; and
\item \code{<LMS>}:  all language models found in
                     the \code{models/lm/} plus all MixLMs found
                     in \code{models/mixlm/} plus all coarse LMs found in
                     \code{models/coarselm/}.
\item If \code{USE_HLDM} or \code{USE_LDM} is set in \code{Makefile.params},
special processing is also done to insert the necessary (H)LDM parameters into
the \code{canoe.ini} file.
\end{itemize}

So now we make the \code{canoe.ini} file:\footnote{Note that your results here
will be slightly different if you followed any of the optional paths above.}
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/decode}
   > \textbf{make canoe.ini}
   ln -fs ../../models
   HLDM: models/ldm/hldm.ibm4+hmm3+ibm2.fr2en.gz
   Sparse: models/sparse/model
   CPTs: models/tm/cpt.merged.hmm3-kn3-zn.tm-train.fr2en.gz
   LMs: models/mixlm/dev1.mixlm
   Coarse LMs:
      DynMap;wordClasses-models/wcl/en.200.classes;models/coarselm/lm-train_en-200-ukn-8g.tplm
      DynMap;wordClasses-models/wcl/en.800.classes;models/coarselm/lm-train_en-800-ukn-8g.tplm
   NNJMs: models/nnjm/fine_tuned/model

   cat canoe.ini.template \bs
      | sed -e 's/<SL>/fr/g' \bs
            -e 's/<TL>/en/g' \bs
            -e 's#<CPTS>#models/tm/cpt.merged.hmm3-kn3-zn.tm-train.fr2en.gz#g' \bs
            -e 's#<LMS>#models/mixlm/dev1.mixlm DynMap;wordClasses-models...#g' \bs
      | configtool -p "args: -dist-phrase-swap -distortion-model WordDisplacement::\bs
        back-hlex#m#H:back-hlex#s#H:back-hlex#d#H:fwd-hlex#m#H:fwd-hlex#s#H:fwd-hlex#d#H \bs
        -lex-dist-model-file :models/ldm/hldm.ibm4+hmm3+ibm2.fr2en.gz#H \bs
        -nnjm-file models/nnjm/fine_tuned/model \bs
        -sparse-model models/sparse/model -force-shift-reduce " - \bs
      > canoe.ini
   configtool check canoe.ini
   Warning: Cannot find local SparseModel weights file; using remote one, which might not be tuned.
   [...]
   ok
\end{alltt}
\end{small}
The \code{ln} command creates a soft link to the \code{models} directory---this way, all
models appear to be relative to the current directory, which makes it easier to
find them.  We create such symlinks everywhere we need access to the models.
Next, the \code{sed} command replaces the template parameters by the appropriate
values.  The \code{configtool} command in the pipeline inserts the distortion
parameters chosen by your framework parameters.  The final command, \code{configtool check
canoe.ini}, confirms that the \code{canoe.ini} file produced is error-free---it
checks that all parameters are compatible and that all models can be found.

Here is the result:
\begin{small}
\begin{alltt}
   > \textbf{cat canoe.ini}
   [ttable-multi-prob] models/tm/cpt.merged.hmm3-kn3-zn.tm-train.fr2en.gz
   [lex-dist-model-file] models/ldm/hldm.ibm4+hmm3+ibm2.fr2en.gz#H
   [lmodel-file] models/mixlm/dev1.mixlm
      DynMap;wordClasses-models/wcl/en.200.classes;models/coarselm/lm-train_en-200-ukn-8g.tplm
      DynMap;wordClasses-models/wcl/en.800.classes;models/coarselm/lm-train_en-800-ukn-8g.tplm
   [minimize-lm-context-size]
   [ttable-limit] 30
   [ttable-threshold] 0
   [ttable-prune-type] full
   [stack] 2000
   [beam-threshold] 0.001
   [distortion-limit] 7
   [dist-limit-simple]
   [dist-phrase-swap]
   [force-shift-reduce]
   [no-bypass-marked]
   [cube-pruning]
   [use-ftm]
   [weight-w] -1
   [distortion-model] WordDisplacement back-hlex#m#H back-hlex#s#H back-hlex#d#H
      fwd-hlex#m#H fwd-hlex#s#H fwd-hlex#d#H
   [sparse-model] models/sparse/model
   [nnjm-file] models/nnjm/fine_tuned/model
\end{alltt}
\end{small}

When using this framework, do not modify \code{canoe.ini} directly.
To change the training parameters for language models or phrase tables, modify
the parameters or Makefiles in the relevant directories
and regenerate those models. Be careful, though: the
\code{canoe.ini} file will include all models generated in the same
framework, so you should work in different copies of the framework if you want
to experiment with different training parameters for those models.
\tip\margintip Soft links can save you a lot of duplicated work if you have
related experiments.\tipend

\tip\margintip Any time a \code{canoe.ini} configuration file is manipulated,
always run \code{configtool check} on the resulting \code{canoe.ini} file to
verify the validity of its contents.\tipend

To change the basic decoding parameters, such as stack size and beam threshold,
modify \code{canoe.ini.template}.
Several decoding parameters control the search strategy.  They have been set
to reasonable defaults, but may need adjustment depending on the
speed/quality trade-offs you're willing to make or other models you wish to
use.

\subsubsection{Running \code{tune.py}}

Besides the configuration file, the other main arguments required by
\code{tune.py} are a directory in which to store temporary files, a source
file and one or more human translations of the source file, which we call
reference translations.\footnote{This example uses only one translation, but
when multiple reference translations are available, it is advantageous to use
them all.} Here we call the temporary directory \code{foos}\footnote{In case
you wonder, back in 2004 the temporary files of our original tuning script,
\code{cow.sh}, were arbitrarily called \code{foo.*}. For convenience they were
eventually tucked away in a directory called \code{foos}. The name stuck.},
and we use the
\code{dev1} files for weight tuning:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/decode}
   > \textbf{make all}
   mkdir -p foos
   filter_models -z -r -tm-soft-limit cpt.dev1 -ldm < ../../corpora/dev1_fr.rule
   tune.py -v -o canoe.ini.cow.FILT -p 4 -c 1 --cpopts=\bs"-rp-j 4\bs" -m 15 -n 100 \bs
      -a lmira -r --workdir=foos -d '-lattice-source-density' -f canoe.ini.FILT \bs
      ../../corpora/dev1_fr.rule ../../corpora/dev1_en.lc &> log.canoe.ini.cow
   configtool -p args:"`configtool weights canoe.ini.cow.FILT`" canoe.ini > canoe.ini.cow
   rm -f -r foos *.FILT.gz *.FILT.bkoff canoe.ini.FILT canoe.ini.cow.FILT
\end{alltt}
\end{small}
Because \code{tune.py} takes a while to run (between 15 minutes and three
hours in this example, depending on your computing hardware)\footnote{Running time for
\code{tune.py} can be reduced by increasing decoder parallelism. Type
\code{tune.py -h} to see documentation on this option: \code{-p}; modify
\code{PARALLELISM\us{}LEVEL\us{}TUNE\us{}DECODE} and/or \code{NCPUS} in \code{Makefile.params} to
adjust tuning parallelism in this framework.} and writes large amounts of logging
information, it is usually a good idea to redirect all of its output to a log
file, as is done here; and also to run it in the background (not done here).
Progress is most easily monitored by looking at the files \code{summary}, which
shows the translation quality (measured by BLEU score---see \S\ref{Testing} for
a description), and \code{summary.wts}, which shows the decoder weights for
each iteration.

Note that the BLEU score does not always increase monotonically, as can
be seen in the \code{summary} file.
% cat summary | perl -pe 's/(\.\d\d\d\d)\d+(:| \S|$)/\1\2/g; s/   -d/  -d/'
% cat summary.wts | perl -pe 's/(\.\d\d\d)\d+(:| \S|$)/\1\2/g; s/(-\d\.\d\d)\d/\1/g; s/   -d/  -d/' | tr ' ' $'\t' | expand-auto.pl -t 1 | cut -c1-100
\begin{footnotesize}
\begin{alltt}
> \textbf{cat summary}
decode-score=0.1934 prev-optimizer-score=0 prev-nbest-size=0 avg-wt-diff=0.0
decode-score=0.0975 prev-optimizer-score=23.23 prev-nbest-size=0 avg-wt-diff=0.0104
decode-score=0.1475 prev-optimizer-score=23.40 prev-nbest-size=0 avg-wt-diff=0.0027
decode-score=0.2144 prev-optimizer-score=25.08 prev-nbest-size=0 avg-wt-diff=0.0134
decode-score=0.2679 prev-optimizer-score=27.20 prev-nbest-size=0 avg-wt-diff=0.0191
decode-score=0.2572 prev-optimizer-score=27.62 prev-nbest-size=0 avg-wt-diff=0.0149
[...]
> \textbf{less -S summary.wts}
1.0   1.0   1.0   1.0   1.0   1.0   1.0   -1.0  0.0 1.0   1.0   1.0   1.0   1.0   1.0   1.0   0.0   ...
-0.34 -0.00 -0.02 0.204 -0.15 0.171 0.100 -1.0  0.0 0.639 0.432 0.670 0.192 -0.03 0.181 0.164 -0.02 ...
-0.10 0.051 -0.03 0.137 -0.12 0.024 0.110 -1.0  0.0 0.694 0.679 0.798 0.114 0.427 0.768 0.094 -0.00 ...
0.674 -0.36 0.251 0.336 -0.33 0.569 0.074 -0.73 0.0 0.799 0.655 0.354 0.047 0.312 0.641 0.139 0.099 ...
0.628 0.158 0.035 0.399 0.610 0.231 0.278 -0.41 0.0 1.0   0.670 0.284 0.374 0.256 0.540 0.166 0.017 ...
0.343 0.362 0.080 0.394 0.661 0.306 0.359 -0.42 0.0 0.982 0.898 0.472 0.637 0.138 -0.01 0.156 0.046 ...
[...]
\end{alltt}
\end{footnotesize}
When the last iteration still shows an improvement (which is not usually the
case in this tutorial), it is sometimes a sign that we should have allowed
\code{tune.py} to run more iterations, by setting the \code{-m} option to a
higher value (which is controlled by the \code{MERT_MAX_ITER} variable in the
framework).

It's hard to make sense of the contents of \code{summary.wts} here: it shows
the weight of each feature at each iteration. Without sparse features, that was
only 16 columns, but with sparse features this file now has 8737 columns!  A
human-readable version of the information is shown in \code{logs/log.decode},
along with lots of logging information about decoding.  Here is an excerpt from
the middle of that log file:
\begin{scriptsize}
\begin{alltt}
Log-linear model used:
index     weight feature description
1         0.1782 DistortionModel:WordDisplacement
2         0.0249 DistortionModel:back-hlex#m#hldm.ibm4+hmm3+ibm2.fr2en.FILT.gz
3         0.2077 DistortionModel:back-hlex#s#hldm.ibm4+hmm3+ibm2.fr2en.FILT.gz
4         0.2399 DistortionModel:back-hlex#d#hldm.ibm4+hmm3+ibm2.fr2en.FILT.gz
5         0.0937 DistortionModel:fwd-hlex#m#hldm.ibm4+hmm3+ibm2.fr2en.FILT.gz
6         0.0605 DistortionModel:fwd-hlex#s#hldm.ibm4+hmm3+ibm2.fr2en.FILT.gz
7         0.2099 DistortionModel:fwd-hlex#d#hldm.ibm4+hmm3+ibm2.fr2en.FILT.gz
8         -0.239 LengthFeature
9         1      SparseModel:./models/sparse/model
10        0.3150 NNJM:models/nnjm/fine_tuned/model
11        1      LanguageModel:models/mixlm/dev1.mixlm;SimpleAutoVoc
12        -0.231 LanguageModel:wordClasses-models/wcl/en.200.classes;models/coarselm/lm-train_en-200-ukn-8g.tplm;SimpleAutoVoc
13        0.0444 LanguageModel:wordClasses-models/wcl/en.800.classes;models/coarselm/lm-train_en-800-ukn-8g.tplm;SimpleAutoVoc
14        0.2390 TranslationModel:cpt.dev1.FILT.gz(col=0)
15        0.1393 TranslationModel:cpt.dev1.FILT.gz(col=1)
16        0.0093 ForwardTranslationModel:cpt.dev1.FILT.gz(col=2)
17        0.0426 ForwardTranslationModel:cpt.dev1.FILT.gz(col=3)
18-83     0.0561 SparseModel:DistCurrentSrcFirstWord:data.mkcls_.._.._corpora_tm-train_fr.lc.gz_20_2
84-239    0.0561 SparseModel:DistCurrentSrcFirstWord:data.mkcls_.._.._corpora_tm-train_fr.lc.gz_50_2
240-485   0.0561 SparseModel:DistCurrentSrcFirstWord:data.svoci_.._.._corpora_tm-train_80
486-551   0.0561 SparseModel:DistCurrentSrcLastWord:data.mkcls_.._.._corpora_tm-train_fr.lc.gz_20_2
552-707   0.0561 SparseModel:DistCurrentSrcLastWord:data.mkcls_.._.._corpora_tm-train_fr.lc.gz_50_2
708-953   0.0561 SparseModel:DistCurrentSrcLastWord:data.svoci_.._.._corpora_tm-train_80
954-1019  0.0561 SparseModel:DistCurrentTgtFirstWord:data.mkcls_.._.._corpora_tm-train_en.lc.gz_20_2
1020-1175 0.0561 SparseModel:DistCurrentTgtFirstWord:data.mkcls_.._.._corpora_tm-train_en.lc.gz_50_2
1176-1421 0.0561 SparseModel:DistCurrentTgtFirstWord:data.tvoci_.._.._corpora_tm-train_80
1422-1487 0.0561 SparseModel:DistCurrentTgtLastWord:data.mkcls_.._.._corpora_tm-train_en.lc.gz_20_2
1488-1643 0.0561 SparseModel:DistCurrentTgtLastWord:data.mkcls_.._.._corpora_tm-train_en.lc.gz_50_2
1644-1889 0.0561 SparseModel:DistCurrentTgtLastWord:data.tvoci_.._.._corpora_tm-train_80
1890-8613 0.3386 SparseModel:AlignedWordPair:data.svoc_.._.._corpora_tm-train_80 data.tvoc_.._.._corpora_tm-train_80
8614-8663 0.4073 SparseModel:LexUnalTgtAny:data.tunal_.._jpt_50
8664-8674 0.0561 SparseModel:PhrasePairCountMultiBin:data.bins11
8675-8723 0.0596 SparseModel:PhrasePairAllBiLengthBins:7 7
8724-8730 0.0561 SparseModel:PhrasePairAllTgtLengthBins:7
8731-8737 0.0596 SparseModel:PhrasePairAllSrcLengthBins:7
\end{alltt}
\end{scriptsize}
Here the features used are listed with their respective weights in a way that
is easy to read, whereas in the \code{summary.wts} file you can see how the
weights evolved to get to the final weights we see here.
You will notice that there are two ``TranslationModel'' weights and features:
one for each of the backward probability estimates in the merged phrase table.
Similarly, there are two ``ForwardTranslationModel'' weights and features.
Since we are using HLDMs, there are seven ``DistortionModel'' weights and
features.  The regular and coarse language models show up, as well as the
length feature which, with its typically negative weight, encourages the
decoder to explore longer sentences, at least partially in compensation with
the language model's tendency to prefer short sentences.  Finally, there is a
line for each sparse feature template showing the range of feature IDs, the sum
of their weights and the template specification.

While \code{tune.py} is working, it saves lattices or $n$-best lists, and other
intermediate files, to its temporary work directory, \code{foos}. If you
wish to prevent these (large) files from being automatically cleaned upon
completion, remove the \code{rm} command executed after \code{tune.py} in
\code{Makefile}; you can find that line by searching for \code{foos}.

% EJJ No longer works, and I don't see an equivalent for tune.py...  I suppose
% the summary file itself is the equivalent now.
%\tip \margintip A lot of information is logged to \code{log.canoe.ini.cow}
%this can be summarized using the command \code{cowpie.py log.canoe.ini.cow}
%(type \code{cowpie.py -h} for details).\tipend

% EJJ No equivalent to cow-timing.pl for tune.py, which is unfortunate.
%\tip \margintip The scripts \code{cow-timing.pl} and
%\code{canoe-timing-stats.pl} provide detailed running time information for the
%various components of \code{cow.sh}.\tipend
\tip \margintip The log file \code{log.canoe.ini.cow} shows how long is spent
in each part of each tuning iteration.\tipend

\tip \margintip The script \code{canoe-timing-stats.pl} provides statistics
about the time the decoder spends loading models and translating sentences,
globally and on a per-sentence basis.  Run from the parent directory of your
experiments, this command can be helpful: \code{canoe-timing-stats.pl
*/models/decode/logs/log.decode}; it can alert you if there are performance
problems with some of your systems.\tipend

\tip \margintip The script \code{summarize-canoe-results.py} can be handy if
you are monitoring multiple experiments: try \code{summarize-canoe-results.py
*} in the parent directory of your experiments.  It will summarize all the
experiments you did, reporting BLEU scores obtained on your dev set as well as
on your test set(s).  For experiments still in progress, it reports iterations
completed and the BLEU score on dev so far.\tipend

The final output from \code{tune.py} is written to the file
\code{canoe.ini.cow} This duplicates the contents of \code{canoe.ini}, but adds
the weights tuned on the development corpus:
% grep weight canoe.ini.cow | perl -pe 's/(\.\d\d\d\d\d\d)\d+/\1/g; s/(\.\d\d\d\d)\d+/\1/g if /weight-d/; s/^/   /'
\begin{small}
\begin{alltt}
   > \textbf{grep weight canoe.ini.cow}
   [weight-l] 1:0.655232:0.305486
   [weight-t] 0.308211:0.223461
   [weight-f] 0.207032:0.202139
   [weight-d] 0.4037:0.3889:0.1811:0.7207:0.7390:0.3564:0.3785
   [weight-w] -0.368477
   [weight-sparse] 1
   [weight-nnjm] 0.4507756121
\end{alltt}
\end{small}

The set of \code{weight-} parameters above provide the decoder features weights:
\begin{itemize}
\item the regular and coarse language models (\code{weight-l})---the
regular language model often obtains one of the highest weights, as is the case
here;\footnote{A note of warning about the LM weight: following Doug Paul's
``ARPA'' LM format, all LM formats we know use base-10 log probs (including
ours). But \code{canoe} interprets them as natural logs: throughout \PS, logs
are natural.  This known bug has minimal impact: correcting for it requires
either multiplying all LM scores by $ln(10) \approx 2.3$, or equivalently
multiplying LM weights themselves by $ln(10)$, which is what
\code{tune.py} and \code{rescore.py} implicitly do during optimization.
We prefer not to fix this bug to avoid breaking
previously trained models.  But to
compare the relative importance given to LMs vs.\ other models, you should
mentally divide LM weights by 2.3.}
\item the backward probability estimates in the translation model (\code{weight-t});
\item the forward probability estimates in the translation model (\code{weight-f});
\item if we were using them, the adirectional features (alignment indicator features) in the translation model (\code{weight-a});
\item the distortion models (basic + HLDM) (\code{weight-d});
\item the sentence-length penalty (if positive) or bonus (if negative) (\code{weight-w});
\item the sparse model weight (\code{weight-sparse}) is always 1 because the real sparse feature
weights are in a separate file, \code{rmodels_sparse_model.wts.gz}, with a
weight for each of the 8720 sparse features used in this tutorial; and
\item the NNJM weight (\code{weight-nnjm}).
\end{itemize}

\subsection{Training a Confidence Estimation Model} \label{CE}

Warning: Confidence Estimation will significantly slow down your PortageLive (see \S\ref{PLive}) server.
Enable only if you really want it for a batch document translation process,
and if your users are not waiting interactively for translations from PortageLive.

If you are using confidence estimation (CE), the final training step is
to create a model to estimate the confidence of the system on its own output.
This model uses features about the source text, the target text, translation
memory information if available, and so on, to come up with a confidence score
between 0 and 1. It is critical that the dev set used to
tune the CE model be completely unseen data. It can't have been part of your
training data or the dev set you used to tune the decoder weights.  Otherwise
the confidence estimates produced will be completely useless.

Edit \code{Makefile.params} in the main \code{toy.experiment} directory and set
\code{DO_CE = 1}.  If you run
\code{make confidence} in the main directory, all necessary steps will be done
automatically.  As usual, we'll break it down a bit.

At this stage, if you followed the instructions in \S\ref{regularLM} to train a
regular LM, we have to return to one previous step.  We did not train all the
models we need for CE: specifically, we need an LM for the source language.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models}
   > \textbf{make lm.fr}
   make -C lm all LM_LANG=fr
   Creating ARPA text format lm-train_fr-kn-5g.lm.gz
   [...]
\end{alltt}
\end{small}
This command will build the required source-side LM using the toolkit you
selected earlier.  If, however, you followed the instructions in
\S\ref{LM+generic-default} or \S\ref{MIXLM} to train a MixLM, the source language
LM has already been trained and \code{make} will tell you it has nothing to do.

Now we can work on the CE model itself. Note that our CE module does not
currently work with rescoring, only with decoding. We train a confidence
estimation model for the decoder model tuned in \S\ref{COW}.

Just as for decoding and rescoring, CE works with a model described in a text
file.  In this framework we provide a template model that does not use a
translation memory.  If you have access to the results of looking up your
source text in a translation memory, it is worth incorporating them into your
CE model.  In the template provided here, all the translation memory related
features are commented out.  See \code{doc/README.confidence} on the \PS CD for
more details.

Now we build the input CE model file:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/confidence}
   > \textbf{make ce-notm.ini}
   ln -fs ../../models
   ln -fs models/decode/rmodels_sparse*.wts.gz .
   sed -e 's#dev1.mixlm#dev1.mixlm#g' models/decode/canoe.ini.cow > canoe.ini.cow
   configtool check canoe.ini.cow
   [...]
   ok
   cat ce-notm.template \bs
      | sed -e "s#IBM\bs\bs(.\bs\bs)FWD#models/tm/ibm\bs\bs1.tm-train.en_given_fr.gz#" \bs
            -e "s#IBM\bs\bs(.\bs\bs)BKW#models/tm/ibm\bs\bs1.tm-train.fr_given_en.gz#" \bs
            -e "s#HMM\bs\bs(.\bs\bs)FWD#models/tm/hmm\bs\bs1.tm-train.en_given_fr.gz#" \bs
            -e "s#HMM\bs\bs(.\bs\bs)BKW#models/tm/hmm\bs\bs1.tm-train.fr_given_en.gz#" \bs
            -e "s#LM_SRC#models\bs/mixlm\bs/lm-train_fr-kn-5g.tplm#" \bs
            -e "s#LM_TGT#models\bs/mixlm\bs/lm-train_en-kn-5g.tplm#" \bs
      > ce-notm.ini
\end{alltt}
\end{small}

The result is a \code{ce-notm.ini} file with the LMs and IBM models filled in,
which we now feed to \code{ce_translate.pl -train}.  The output will be a
trained CE model.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/confidence}
   > \textbf{make all}
   ce_translate.pl -n=4 -train -src=fr -tgt=en -notok -nolc -nl=s \bs
      -k=5 -desc=ce-notm.ini canoe.ini.cow ce_model \bs
      ../../corpora/dev3_fr.lc ../../corpora/dev3_en.lc
\end{alltt}
\end{small}

The trained model is saved to the file \code{ce_model.cem}.  This file is
actually a gzipped-tarred archive, so you could examine it using the
command \code{tar -xOzf ce_model.cem | less}.\footnote{Note that the command
\code{tar -xOzf} contains the capital letter \code{O} (oh), not the number
\code{0} (zero).} However, the contents don't have much intuitive meaning, so
this is only useful for curiosity's sake.

\subsection{Training a Rescoring Model} \label{RAT}

If you are using rescoring, the final training step is to create a model for
rescoring $n$-best lists.  Rescoring means having \code{canoe} generate a list
of $n$ (typically 1000) translation hypotheses for each source sentence, then
choosing the best translations among these using information that is
too expensive for the decoder to compute during search.

Warnings: Rescoring sometimes gives
a modest improvement over the results obtained using \code{canoe} alone. Often,
however, it gives no significant improvement while being fairly slow.
Though useful for research, rescoring is not recommended for
commercial use and is disabled by default. It is
incompatible with confidence estimation, improved truecasing using source-side
information, and its installation in PortageLive (see \S\ref{PLive}) is not
automated.

Training a rescoring model involves generating $n$-best lists, then calculating
the values of selected features for each hypothesis in each list. A
feature is any real-valued function that captures some relation
between a source sentence and a translation hypothesis. A rescoring model
is a feature weight vector that optimizes translation
performance when the weighted combination of feature values is used to reorder
the $n$-best lists.

To complete this section of the tutorial, you will need to
edit \code{Makefile.params} to set \code{DO_RESCORING = 1} and comment out
\code{TC_USE_SRC_MODELS = 1} and \code{DO_CE = 1}.

You can run all the steps below by typing \code{make rescore}
in your main \code{toy.experiment} directory, but we'll break it down as
usual.

\subsubsection{The Input Rescoring Model}

Training is carried out by the \code{rescore.py} script. This takes as input a
rescoring model that specifies which features to use, and finds optimal
weights for these features, maximizing BLEU on your rescoring dev set.

The default model created by this framework contains a small set of useful
features:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/rescore}
   > \textbf{make rescore-model.ini}
   configtool rescore-model:ffvals models/decode/canoe.ini.cow \bs
      | cut -f 1 -d ' ' > rescore-model.ini
   cat rescore-model.template \bs
      | sed -e "s#IBM\bs\bs(.\bs\bs)FWD#models/ibm/ibm\bs\bs1.tm-train.en_given_fr.gz#" \bs
            -e "s#IBM\bs\bs(.\bs\bs)BKW#models/ibm/ibm\bs\bs1.tm-train.fr_given_en.gz#" \bs
            -e "s#HMM\bs\bs(.\bs\bs)FWD#models/ibm/hmm\bs\bs1.tm-train.en_given_fr.gz#" \bs
            -e "s#HMM\bs\bs(.\bs\bs)BKW#models/ibm/hmm\bs\bs1.tm-train.fr_given_en.gz#" \bs
      | egrep -v '^#' \bs
      >> rescore-model.ini
   > \textbf{cat rescore-model.ini}
   FileFF:ffvals,1
   FileFF:ffvals,2
   [...]
   FileFF:ffvals,17
   LengthFF
   IBM1TgtGivenSrc:models/ibm/ibm1.tm-train.en_given_fr.gz
   IBM1SrcGivenTgt:models/ibm/ibm1.tm-train.fr_given_en.gz
   IBM2TgtGivenSrc:models/ibm/ibm2.tm-train.en_given_fr.gz
   IBM2SrcGivenTgt:models/ibm/ibm2.tm-train.fr_given_en.gz
   [...]
   nbestNgramPost:2#1#<ffval-wts>#<pfx>
   nbestSentLenPost:1#<ffval-wts>#<pfx>
   ParMismatch
   QuotMismatch:ef
   RatioFF
\end{alltt}
\end{small}
There are two kinds of features included in our rescoring model:
\begin{itemize}
\item Those that look like \code{FileFF:ffvals,}$i$ tell \code{rescore.py} to
use the $i$\textsuperscript{th} feature generated by the decoder.
It is standard practice to use all decoder features when rescoring.
\item The other features, following the format \emph{Feature:Args}, tell
\code{rescore.py} to generate values for feature \emph{Feature} using
arguments \emph{Args}.  For example, one feature is
\code{IBM2TgtGivenSrc} calculated on model
\code{model/tm/ibm2.tm-train.en_given_fr.gz} trained earlier.
\code{rescore_train -H} lists all available features.
In this framework, set your features in \code{rescore-model.template}.
The special tokens in all-caps (\code{IBM1FWD}, etc) in the
template are replaced by the actual models trained earlier.
\end{itemize}

Lines starting with \code{\#} are comments and are ignored by the software.

\subsubsection{Running \code{rescore.py}}

Apart from the rescoring model, \code{rescore.py} needs a source file and one or
more reference translations for it (same as \code{tune.py}). These may be the
same files used for \code{tune.py}, but it is sometimes better to use different
ones. Here we use \code{dev2}:\footnote{As with \code{tune.py}, you can
speed this up by increasing parallelism, which is controled by the
\code{--cp-ncpus} option to \code{rescore.py} and the
\code{PARALLELISM\us{}LEVEL\us{}TUNE\us{}RESCORE} framework variable.}
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/models/rescore}
   > \textbf{make all}
   ln -fs models/decode/rmodels_sparse*.wts.gz .
   sed -e 's#dev1.mixlm#dev1.mixlm#g' models/decode/canoe.ini.cow > canoe.ini.cow.dev2
   configtool check canoe.ini.cow.dev2
   [...]
   ok
   Tuning the rescoring model.
   rescore.py --cp-numpar 10 --cp-ncpus 1 --train --algorithm mira \bs
      --verbose --nbest-size 1000 --model-out rescore-model \bs
      --marked-src ../../corpora/dev2_fr.rule --canoe-config canoe.ini.cow.dev2 \bs
      rescore-model.ini ../../corpora/dev2_fr.lc ../../corpora/dev2_en.lc
   rm -f -r workdir-dev2_fr.rule-1000best
\end{alltt}
\end{small}

Like for decoder tuning (\S\ref{COW}), we use N-best MIRA for rescoring.
The \code{-a mira} switch provided to \code{rescore.py}
above selects this state-of-the-art tuning algorithm.

The output from \code{rescore.py} is written to file \code{rescore-model}:
% cat rescore-model | perl -pe 's/(\.\d\d\d\d\d\d)\d+/\1/g'
\begin{footnotesize}
\begin{alltt}
   > \textbf{cat rescore-model}
   FileFF:ffvals,1 0.388134
   FileFF:ffvals,2 -6.732545E-4
   [...]
   FileFF:ffvals,17 -0.078686
   LengthFF 0.420451
   IBM1TgtGivenSrc:models/ibm/ibm1.tm-train.en_given_fr.gz 0.486410
   IBM1SrcGivenTgt:models/ibm/ibm1.tm-train.fr_given_en.gz 0.084028
   IBM2TgtGivenSrc:models/ibm/ibm2.tm-train.en_given_fr.gz -0.226396
   IBM2SrcGivenTgt:models/ibm/ibm2.tm-train.fr_given_en.gz 0.136847
   [...]
   nbestNgramPost:2#1#<ffval-wts>#<pfx> -0.324966
   nbestSentLenPost:1#<ffval-wts>#<pfx> 0.255397
   ParMismatch 0.0
   QuotMismatch:ef 0.0
   RatioFF 0.006098
\end{alltt}
\end{footnotesize}
This is a copy of \code{rescore-model.ini} with a weight assigned to each
feature. Other by-products created by \code{rescore.py} are found in the directory
\code{workdir-dev2_fr.lc-1000best} and include the $n$-best lists
\code{1000best.gz}, and the corresponding decoder features \code{ffvals.gz}
and additional features \code{ff.*}. All of these files are compressed to
save space and automatically deleted unless there is a problem.
Remove the \code{rm} command executed after \code{rescore.py} in \code{Makefile} if
you want to preserve them.

Before continuing, comment out \code{DO_RESCORING = 1} and set
\code{TC_USE_SRC_MODELS = 1} again in \code{Makefile.params}, since these are
the values assumed below.

\section{Translating and Testing} \label{TranslatingTesting}

\subsection{Translating} \label{Translating}

Once training is complete, the system can be used to translate new text or the
test corpus.

Some of the steps below will be performed if you run \code{make translate} in
your main \code{toy.experiment} directory, but the final output you need
depends on what you are doing and might not be produced by default. The actual
output produced here depends on the various \code{DO_*} variables in
\code{Makefile.params}.

The primary point of the translations we produce here is evaluation using the
BLEU score.  Refer to \S\ref{NewText} and \S\ref{PLive} for how to handle new
text in easier ways.

\subsubsection{Decoding Only} \label{Decoding}

There are three options for translating. The simplest is
to decode using the configuration file produced by \code{tune.py} and stop.
To do so, make sure \code{DO_RESCORING} is not set in \code{Makefile.params}.\footnote{You
can accomplish the same result by typing
\code{make translate DO\us{}RESCORING=}
instead of just \code{make translate} (notice there is nothing after the
\code{=} sign). This syntax overrides a
variable definition on a given call to \code{make}: adding \code{V=} undefines
\code{V}, while adding \code{V=\emph{value}} sets V to the given value.}
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/translate}
   > # edit ../Makefile.params and comment out DO_RESCORING = 1
   > \textbf{make translate}
   ln -fs ../models
   ln -fs models/decode/rmodels_sparse*.wts.gz .
   cat models/decode/canoe.ini.cow > canoe.ini.cow.test1
   configtool check canoe.ini.cow.test1
   [...]
   ok
   Generating translate/test1.out
   canoe-parallel.sh -n 1 -rp-j 4 canoe -f canoe.ini.cow.test1 -walign -palign \bs
      < ../corpora/test1_fr.rule \bs
      | nbest2rescore.pl -canoe -tagoov -oov -wal -palout=test1.out.pal \bs
      | tee test1.out.oov | perl -pe 's/<OOV>(.+?)<\bs/OOV>/\bs1/g;' > test1.out
   [repeat for test2]
\end{alltt}
\end{small}
This produces the output files \code{test1.out} and \code{test2.out},
containing one line for each source line in \code{test*.rule}, as well as
\code{test*.out.pal} and \code{test*.out.oov}, needed for truecasing.

If you look at the output files, remember that we trained this toy system on a
trivial amount of data.  You
need a lot more data to get good results.

\subsubsection{Decoding plus Confidence Estimation} \label{CETrans}

The second option for translating is to run the decoder and then estimate the
confidence of the system on the output of the decoder.  This option is not
compatible with rescoring: the confidence estimate is for the one-best output
of the decoder only.  We use the decoder model tuned in \S\ref{COW}, and
then the CE model trained in \S\ref{CE}.  The program
\code{translate.pl} handles both of these steps:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/translate}
   > \textbf{make confidence}
   ln -fs models/confidence/ce_model.cem
   Generating translate/test1.ce
   translate.pl -with-ce -n 1 -notok -nl s -tc -encoding UTF-8 -src fr -tgt en \bs
      -src-country CA -f canoe.ini.cow.test1 -model ce_model.cem \bs
      ../corpora/test1_fr.al > test1.ce
   [repeat for test2]
\end{alltt}
\end{small}
The output files \code{test*.ce} are different from \code{test*.out}
in that 1) a confidence estimate is added
at the beginning of each line, and 2) the output has already been
truecased and detokenized.

\subsubsection{Decoding plus Rescoring} \label{RATTrans}

If you executed the commands in the \emph{Decoding Only} section above, please
run \code{make clean} before proceeding.  To run this variant, edit
\code{Makefile.params} to set \code{DO_RESCORING = 1}
and comment out \code{TC_USE_SRC_MODELS = 1}.  We recommend using the new
truecasing workflow, but we have to disable it here since it is not
compatible with rescoring.

The third option for translating is to generate $n$-best lists and rescore
them using the model generated in \S\ref{RAT}. To do this, we first run
the decoder model tuned in \S\ref{COW}, and then the rescoring model
tuned in \S\ref{RAT}.  The \code{rescore.py} script, used in translation
mode, performs both steps for us:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/translate}
   > \textbf{make clean}
   > \textbf{make translate}
   ln -fs ../models
   ln -fs models/decode/rmodels_sparse*.wts.gz .
   cat models/decode/canoe.ini.cow > canoe.ini.cow.test1
   configtool check canoe.ini.cow.test1
   [...]
   ok
   Generating test1.rat
   rescore.py --cp-numpar 1 --trans --verbose --nbest-size 1000 \bs
      --marked-src ../corpora/test1_fr.rule --canoe-config canoe.ini.cow.test1 \bs
      models/rescore/rescore-model ../corpora/test1_fr.lc \bs
      && mv test1_fr.rule.rat test1.rat
   cp workdir-test1_fr.rule-1000best/1best test1.out
   [repeat for test2]
\end{alltt}
\end{small}
This produces both \code{test*.out}, the 1-best output of the decoder, and
\code{test*.rat},\footnote{``rat'' stands for Rescore and Translate} the best translation according to the rescoring model.
The files \code{test*.out} should be nearly identical to the ones produced in
\S\ref{Decoding}, but not always exactly, due to
rounding differences.

Before continuing, comment out \code{DO_RESCORING = 1} and set
\code{TC_USE_SRC_MODELS = 1} again in \code{Makefile.params}, since these are
the values assumed below.  Also, run \code{make clean} to remove files
incompatible with the new truecasing workflow.

\subsubsection{Truecasing and Detokenization} \label{Truecasing}

If you inspect \code{test*.out} and/or \code{test*.rat}, you
will notice that they contain lowercase, tokenized text.  Two postprocessing steps
are required to restore normal text: truecasing and
detokenization.  \code{test*.ce} already
contains truecased, detokenized text, because \code{translate.pl} itself
performs the steps we're about to do.

Truecasing is done by the \code{truecase.pl} script using the truecasing models
we trained:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/translate}
   > \textbf{make tc}
   Truecasing translate/test1.out.tc
   truecase.pl -text test1.out.oov -bos \bs
      -lm models/tc/lm-train_en-kn-3g.binlm.gz -map models/tc/lm-train_en.map \bs
      -src ../corpora/test1_fr.al -pal test1.out.pal -locale fr_CA.UTF-8 \bs
      -srclm models/tc/lm-train_fr.nc1.binlm.gz > test1.out.tc
   [repeat for test2]
\end{alltt}
\end{small}

If you are using the old truecasing workflow, the same files will be generated
by \code{truecase.pl}, but the options will be different (the \code{-src},
\code{-pal}, \code{-srclang} and \code{-srclm} options will be omitted), and you
will not get source casing patterns transferred to the output.

If you are using rescoring, \code{make tc} will also truecase the rescored
output, generating \code{test*.rat.tc}.

Detokenization is done by the script \code{udetokenize.pl}, a rule-based
detokenizer for French, English, Spanish and Danish.
For other languages, specify your detokenizer for language \emph{ll} by setting variable \code{DETOKENIZER_\emph{ll}} in \code{Makefile.params}.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/translate}
   > \textbf{make detok}
   Detokenizing test1.out
   udetokenize.pl -lang=en < test1.out > test1.out.detok
   Detokenizing test1.out.tc
   udetokenize.pl -lang=en < test1.out.tc > test1.out.tc.detok
   [repeat for test2]
\end{alltt}
\end{small}

If you are using rescoring, \code{make detok} will also detokenize the
rescoring output.

\subsection{Testing} \label{Testing}

Translation quality can be evaluated automatically using the BLEU metric, which
is a measure of how well the translation matches one or more reference
translations. It is
based on the number of $n$-grams that the translation has in common
with the references, and varies between 0 for no matches to 1 for a perfect
match. BLEU is calculated by \code{bleumain}:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment/translate}
   > \textbf{make bleu}
   Calculating BLEU for translate/test1.out.bleu
   bleumain -c test1.out ../corpora/test1_en.lc > test1.out.bleu
   Calculating BLEU for translate/test1.rat.bleu
   bleumain -c test1.rat ../corpora/test1_en.lc > test1.rat.bleu
   [repeat for test2]
   grep Human *.bleu
   test1.out.bleu:Human readable value: 23.78 +/- 3.65
   test1.rat.bleu:Human readable value: 23.81 +/- 3.62
   test2.out.bleu:Human readable value: 23.06 +/- 3.44
   test2.rat.bleu:Human readable value: 23.35 +/- 3.45
\end{alltt}
\end{small}
Here we show the BLEU score for rescoring too, for illustration, but without
\code{DO_RESCORING} you would only see \code{test*.out.bleu}.

The human readable value is the BLEU score multiplied by 100 rounded to 2
decimals. The full output from \code{bleumain}, saved in \code{*.bleu},
contains match statistics of various orders, followed by the global BLEU score
with a 95\% confidence interval, shown above.

The results above can be obtained directly by typing \code{make eval} or
\code{make all} in your \code{toy.experiment} directory.

Here, we calculate only the BLEU scores on the lowercase output, using the
lowercase reference, but you can get BLEU scores for the truecased output by
(manually) giving the (tokenized) truecase translation and reference(s) to \code{bleumain}.

These results indicate that the translations produced by rescoring is trivially better
than the ones produced by plain decoding. Given the small size of the test set,
it is unlikely that the difference is statistically significant. To test
this hypothesis, we can use \code{bleucompare}, which does a comparison using
pairwise bootstrap resampling:
\begin{small}
\begin{alltt}
   > \textbf{bleucompare test1.rat test1.out REFS ../corpora/test1_en.lc}
   Comparing using BLEU
   test1.rat got max BLEU score in 46.3% of samples
   test1.out got max BLEU score in 53.7% of samples
   > \textbf{bleucompare test2.rat test2.out REFS ../corpora/test2_en.lc}
   Comparing using BLEU
   test2.rat got max BLEU score in 78.4% of samples
   test2.out got max BLEU score in 21.6% of samples
\end{alltt}
\end{small}
This indicates that the difference is indeed not significant for \code{test1}.
or \code{test2}, even at the p=0.1 level, which would not
usually be considered significant anyway.  (Expect
different results here, since the training corpus used is much too small to
produce predictable results.)

\tip \margintip Now that you have completed both training and testing, try
running \code{summarize-canoe-results.py *} in the parent directory of your
experiments again, it will show more interesting output.\tipend

Once again, if you produced BLEU scores for the rescoring output, as shown
here, run \code{make clean} in \code{translate/} before proceeding to \S\ref{PLive}.

\section{Translating New Text} \label{NewText}

The translation methods shown in the previous section only work well with the test
sets you dropped into the \code{corpora} directory.
For new text, use the \code{translate.pl} script.
It is aware of the way this framework works and where it puts its
models, so it makes it easy to carry out the whole translation pipeline:
tokenization, lowercasing, decoding, truecasing, detokenization, with optional
plugins you can use to insert ad hoc code to handle special cases correctly.
See \code{translate.pl -h} for details.  The \code{translate.sh} script
at the root of the framework calls \code{translate.pl} with some of the options
automatically filled in from the parameters you set in \code{Makefile.params}.
See \code{./translate.sh -h} for details.

\section{PortageLive} \label{PLive}

Now that we have a fully trained and tested system, you can deploy it
on a PortageLive translation server.

The PortageLive models are optimized in various ways for a run-time
environment.  Most importantly, they use our tightly packed model file format,
accessed via memory-mapped I/O.
%; the various phrase tables are collapsed into a single table with the weights
%pre-applied; filtering of the phrase table, normally done on the fly, is
%pre-performed; and so on.
The result is a system that will run faster and with less memory on the
run-time server than equivalent requests would within the framework.

This process is automated:
running \code{make portageLive} from the framework
root converts all models to
their tightly-packed equivalents,
filtering them as required, and creates a directory structure with symbolic
links to the files needed on a run-time translation server.

If you get any errors here, it may be due to inconsistent settings used in various subdirectories while you
worked through the tutorial.
You might have to run \code{make clean} in some subdirectories, or maybe even at the framework root,
before continuing.

\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make portageLive}
   [lots of prerequisite stuff happens]
   [...]
   make -C models portageLive
   mkdir -p portageLive
   [...]
   [lots of model conversion and preparation happens]
   [the portageLive directory structure is filled with the necessary symbolic links]
   [...]
   You now have all that is needed for PortageLive.
   From the framework root, run one of the following commands to
   transfer the PortageLive models to your server:
        rsync -Larz models/portageLive/* <REMOTE_HOST>:<DEST_DIR_ON_REMOTE_HOST>
   or   scp -r models/portageLive/* <REMOTE_HOST>:<DEST_DIR_ON_REMOTE_HOST>
   or   cp -Lr models/portageLive/* <DEST_DIR_ON_LOCAL_HOST>
   Afterwards, optimize pretrained models on each PortageLive server:
        ssh <REMOTE_HOST> plive-optimize-pretrained.sh <DEST_DIR_ON_REMOTE_HOST>
   or   plive-optimize-pretrained.sh <DEST_DIR_ON_LOCAL_HOST>
\end{alltt}
\end{small}

The three suggested copy commands (\code{rsync}, \code{scp}, \code{cp}) at the end of the
output show you how to perform a deep copy
of the structure created with symbolic link expansion.
Instructions are also provided for calling \code{plive-optimize-pretrained.sh},
which facilitates loading into shared memory a single copy of the pre-trained
(generic) models for use by all your translation systems running on the
PortageLive server.
Make sure you've installed the generic models in \code{/opt/PortageII/models/pretrained/} as instructed in
\code{generic-model/REAME} first.

You can use \code{du -hL} to find out the total size that structure before copying it:
\begin{small}
\begin{alltt}
   > \textbf{du -hL models/portageLive}
   8.0K    models/portageLive/plugins/fixedTerms
   12K     models/portageLive/plugins
   38M     models/portageLive/models/ldm/hldm.ibm4+hmm3+ibm2.fr2en.tpldm
   38M     models/portageLive/models/ldm
   1.1M    models/portageLive/models/wcl
   1.9M    models/portageLive/models/tc/nc1-lm.fr.tplm
   504K    models/portageLive/models/tc/tc-map.en.tppt
   1.8M    models/portageLive/models/tc/tc-lm.en.tplm
   4.2M    models/portageLive/models/tc
   280M    models/portageLive/models/nnjm/fine_tuned
   280M    models/portageLive/models/nnjm
   47M     models/portageLive/models/tm/cpt.merged.hmm3-kn3-zn.tm-train.fr2en.tppt
   47M     models/portageLive/models/tm
   725M    models/portageLive/models/mixlm/generic-2.0_en.tplm
   4.9M    models/portageLive/models/mixlm/lm-train_en-kn-5g.tplm
   729M    models/portageLive/models/mixlm
   4.1M    models/portageLive/models/ibm
   776K    models/portageLive/models/sparse
   9.0M    models/portageLive/models/coarselm/lm-train_en-800-ukn-8g.tplm
   7.9M    models/portageLive/models/coarselm/lm-train_en-200-ukn-8g.tplm
   17M     models/portageLive/models/coarselm
   1.1G    models/portageLive/models
   1.1G    models/portageLive
\end{alltt}
\end{small}

You'll notice here that the biggest model is the pretrained generic LM---hence the importance of runing
\code{plive-optimize-pretrained.sh} to make sure you only use of copy of it on your PortageLive server.

\section{Resource Summary} \label{ResourceSummary}

Now that the whole experiment has run, we can get a global summary of resource
usage from all the parts of the framework, as shown below:
%in Figure~\ref{FigTimeMem}.

%\begin{sidewaysfigure}
%\begin{figure}
%\caption{Time and memory resource summary}
%\label{FigTimeMem}
\begin{small}
\begin{alltt}
> \textbf{cd toy.experiment}
> \textbf{make time-mem}
\end{alltt}
\end{small}
\vspace{-2em}
\begin{tiny}
\begin{alltt}
Resource summary for \$PORTAGE/toy.experiment:
         log.lm-train_en-200-ukn-8g.binlm                 WALL TIME: 20s      CPU TIME: 19s       PCPU: 100\%   VSZ: 0.042G    RSS: 0.007G
         log.lm-train_en-200-ukn-8g.lm                    WALL TIME: 4s       CPU TIME: 4s        PCPU: 117\%   VSZ: 0.042G    RSS: 0.007G
         log.lm-train_en-800-ukn-8g.binlm                 WALL TIME: 21s      CPU TIME: 20s       PCPU: 100\%   VSZ: 0.042G    RSS: 0.007G
         log.lm-train_en-800-ukn-8g.lm                    WALL TIME: 5s       CPU TIME: 5s        PCPU: 117\%   VSZ: 0.042G    RSS: 0.007G
         log.tplm.lm-train_en-200-ukn-8g                  WALL TIME: 7s       CPU TIME: 7s        PCPU: 94\%    VSZ: 0.077G    RSS: 0.010G
         log.tplm.lm-train_en-800-ukn-8g                  WALL TIME: 8s       CPU TIME: 8s        PCPU: 94\%    VSZ: 0.054G    RSS: 0.008G
      coarselm                                            WALL TIME: 1m5s     CPU TIME: 1m4s      PCPU: 100\%   VSZ: 0.077G    RSS: 0.010G
         log.ce_model.cem                                 WALL TIME: 32s      CPU TIME: 2m3s      PCPU: 89\%    VSZ: 0.114G    RSS: 0.019G
      confidence                                          WALL TIME: 32s      CPU TIME: 2m3s      PCPU: 89\%    VSZ: 0.114G    RSS: 0.019G
         log.canoe.ini.cow                                WALL TIME: 8m31s    CPU TIME: 9m4s      PCPU: 107\%   VSZ: 16.727G   RSS: 3.233G
         log.cpt.dev1.FILT                                WALL TIME: 11s      CPU TIME: 12s       PCPU: 120\%   VSZ: 0.042G    RSS: 0.007G
            log.aggregate                                 WALL TIME: 0s       CPU TIME: 0s                     VSZ: 0.000G    RSS: 0.000G
            log.decode                                    WALL TIME: 16s      CPU TIME: 5m45s     PCPU: 93\%    VSZ: 0.042G    RSS: 0.007G
            log.eval                                      WALL TIME: 0s       CPU TIME: 0s                     VSZ: 0.000G    RSS: 0.000G
            log.optimize                                  WALL TIME: 29s      CPU TIME: 5m37s     PCPU: 129\%   VSZ: 16.633G   RSS: 3.218G
         logs                                             WALL TIME: 45s      CPU TIME: 11m22s    PCPU: 108\%   VSZ: 16.633G   RSS: 3.218G
         timing.log                                       WALL TIME: 19s      CPU TIME: 37s       PCPU: 98\%    VSZ: 0.042G    RSS: 0.007G
      decode                                              WALL TIME: 9m46s    CPU TIME: 21m15s    PCPU: 107\%   VSZ: 16.727G   RSS: 3.233G
         log.hmm3.tm-train.en_given_fr                    WALL TIME: 2m27s    CPU TIME: 57s       PCPU: 40\%    VSZ: 0.103G    RSS: 0.016G
         log.hmm3.tm-train.fr_given_en                    WALL TIME: 2m15s    CPU TIME: 55s       PCPU: 42\%    VSZ: 0.103G    RSS: 0.016G
      [...]
         log.cpt.merged.hmm3-kn3-zn.tm-train.fr2en        WALL TIME: 1m29s    CPU TIME: 1m27s     PCPU: 99\%    VSZ: 0.203G    RSS: 0.110G
         log.cpt.merged.hmm3-kn3-zn.tm-train.fr2en.tppt   WALL TIME: 40s      CPU TIME: 1m9s      PCPU: 94\%    VSZ: 0.111G    RSS: 0.019G
      tm                                                  WALL TIME: 2m9s     CPU TIME: 2m36s     PCPU: 97\%    VSZ: 0.203G    RSS: 0.110G
         log.tm-train.hmm3.fr2en.align                    WALL TIME: 22s      CPU TIME: 6s        PCPU: 28\%    VSZ: 0.163G    RSS: 0.019G
         log.tm-train.ibm2.fr2en.align                    WALL TIME: 22s      CPU TIME: 3s        PCPU: 16\%    VSZ: 0.165G    RSS: 0.020G
         log.tm-train.ibm4.fr2en.align                    WALL TIME: 2s       CPU TIME: 1s        PCPU: 95\%    VSZ: 0.042G    RSS: 0.007G
      wal                                                 WALL TIME: 46s      CPU TIME: 10s       PCPU: 24\%    VSZ: 0.165G    RSS: 0.020G
         log.en.200                                       WALL TIME: 10s      CPU TIME: 9s        PCPU: 100\%   VSZ: 0.042G    RSS: 0.007G
         log.en.800                                       WALL TIME: 15s      CPU TIME: 14s       PCPU: 100\%   VSZ: 0.042G    RSS: 0.007G
         log.fr.200                                       WALL TIME: 12s      CPU TIME: 11s       PCPU: 99\%    VSZ: 0.042G    RSS: 0.007G
         log.fr.800                                       WALL TIME: 18s      CPU TIME: 17s       PCPU: 100\%   VSZ: 0.042G    RSS: 0.007G
      wcl                                                 WALL TIME: 55s      CPU TIME: 51s       PCPU: 100\%   VSZ: 0.042G    RSS: 0.007G
   models                                                 WALL TIME: 41m5s    CPU TIME: 50m33s    PCPU: 93\%    VSZ: 16.727G   RSS: 3.233G
      log.test1.ce                                        WALL TIME: 24s      CPU TIME: 22s       PCPU: 93\%    VSZ: 0.042G    RSS: 0.007G
      [...]
      log.test2.rat                                       WALL TIME: 8m22s    CPU TIME: 8m12s     PCPU: 100\%   VSZ: 0.110G    RSS: 0.028G
   translate                                              WALL TIME: 17m18s   CPU TIME: 16m21s    PCPU: 98\%    VSZ: 0.110G    RSS: 0.028G
TIME-MEM                                                  WALL TIME: 58m23s   CPU TIME: 1h6m54s   PCPU: 94\%    VSZ: 16.727G   RSS: 3.233G
\end{alltt}
\end{tiny}
%\end{sidewaysfigure}
%\end{figure}

The output of \code{make time-mem} tells us how much RAM (``RSS'') and
virtual memory (``VSZ'') was used by each step of the process. It also tells us
the total amount of CPU time, the actual elapsed time (``wall time''), and the
percent CPU utilisation when it can be calculated.  When
running in parallel, the total CPU time will be higher than the wall time, as is
the case on many lines here.

On each summary line (all the outdented lines are summary lines),
the wall and CPU times is the sum of the
lines it summarizes, whereas the memory figures are the maximums from
the lines that it summarizes.

The memory figures might be a bit misleading: they reflect the maximum
RAM/virtual memory needed by any one component within a run.  In the case of a
process that is parallelized, say, 10 ways, the figure reflects the most memory
any one of the 10 parallel processes used, not the total used at any given
time.  To plan hardware resources, you should take into
account the parallelism level you want to use for each step.

The numbers here are not very interesting because we ran on such a small
system, but they will be informative when you run on real data.

The second type of resource summary you can produce is the disk space occupied
by the models.  When you run \code{make summary}, you will first get the
\code{time-mem} report, then the following disk space report:
\begin{small}
\begin{alltt}
   > \textbf{cd toy.experiment}
   > \textbf{make summary}
   [output from time-mem plus:]

   Disk usage for all models:
   4.7M    models/ibm/ibm1.tm-train.en_given_fr.gz
   [...]
   4.0K    models/ibm/hmm3.tm-train.fr_given_en.dist.gz
   2.1M    models/ibm/hmm3.tm-train.fr_given_en.gz
   11M     models/jpt/jpt.hmm3.tm-train.fr-en.gz
   5.3M    models/jpt/jpt.ibm2.tm-train.fr-en.gz
   7.3M    models/jpt/jpt.ibm4.tm-train.fr-en.gz
   17M     models/jpt/jpt.merged.tm-train.fr-en.gz
   40M     models/tm/cpt.merged.hmm3-kn3-zn.tm-train.fr2en.gz
   47M     models/tm/cpt.merged.hmm3-kn3-zn.tm-train.fr2en.tppt
   164K    models/coarselm/lm-train_en-200.lc.gz
   [...]
   8.5M    models/coarselm/lm-train_en-800-ukn-8g.lm.gz
   8.9M    models/coarselm/lm-train_en-800-ukn-8g.tplm
   4.0K    models/mixlm/dev1.mixlm
   4.0K    models/mixlm/generic-2.0_en.tplm
   4.0K    models/mixlm/generic-2.0_fr.tplm
   5.4M    models/mixlm/lm-train_en-kn-5g.lm.gz
   4.9M    models/mixlm/lm-train_en-kn-5g.tplm
   5.9M    models/mixlm/lm-train_fr-kn-5g.lm.gz
   5.3M    models/mixlm/lm-train_fr-kn-5g.tplm
   16M     models/tc
   84M     models/ldm
   1.8M    models/sparse
   280M    models/nnjm/fine_tuned
   3.4M    models/decode
   12K     models/confidence/ce_model.cem
   372K    translate
   598M    total

   Disk usage for portageLive models:
   [output from du -hL models/portageLive, as shown in the previous section]
\end{alltt}
\end{small}

This report differs from a simple \code{du -h} in that it looks specifically
for the model files. It also shows the size of the portageLive models prepared
for deployment on a run-time translation server.

\section*{Final Note}
Because of differences in rounding, optimization, random number generation,
compilers, hardware, etc., results, especially numerical ones, are expected to
vary on different systems and are shown in this document only as an indication
of the type of output to expect, especially given the trivial size of the
corpus used.

\section*{Bibliography}
The bibliography for this document is found in the \PS user manual under
section ``Annotated Bibliography''.
\end{document}