blogNotes on language, machine learning, etc.
http://blog.jacobandreas.net/
Thu, 09 Jul 2020 01:10:29 +0000Thu, 09 Jul 2020 01:10:29 +0000Jekyll v3.8.7Meanings and belief states<p>Should AI research be concerned with explicit representations of the meanings of
utterances? By “explicit representations of meaning” I mean structured variables
with a pre-defined interpretation—the kind of thing that semanticists are
concerned with. For a long time, such meaning representations were central to
successful efforts aimed at linking language to other tasks involving reasoning,
perception, and action (a line of work that runs from SHRDLU to modern semantic
parsers).
<span class="aside"> Structured meaning representations were also
central to <em>unsuccessful</em> work on machine translation, syntax, etc.</span>
This work uses lots of different representation formalisms—proper
neo-Davidsonian logical forms
<a href="http://aclweb.org/anthology/Q/Q13/Q13-1005.pdf">[AZ13]</a>,
combinator logics
<a href="http://www.aclweb.org/anthology/P11-1060">[LJK11]</a>,
other non-logical structures
<a href="https://people.csail.mit.edu/stefie10/publications/tellex11a.pdf">[TK+11]</a>—but
if you squint they’re all basically predicate-argument structures implementing a
model-theoretic semantics, with perhaps a few free parameters in the bodies of
the predicates.</p>
<p>These kinds of approaches seem to be disappearing. Now that everything is
end-to-end all the time, it’s rare to see models with explicit latent variables
taking values in hand-designed logical languages. Utterances come into our
models, behaviors come out, and we don’t worry too much about the structure of
the computation that gets performed in the middle.<br />
<span class="aside">There are still certain kinds of generalization and inductive bias we used to
get for free with the old models that we haven’t totally figured out how to
recreate. The success of hybrid approaches like structured regularizers
<a href="http://arxiv.org/abs/1706.05064">[OS+17]</a> and our NMN work
<a href="http://arxiv.org/abs/1601.01705">[AR+16]</a> suggest we’ll get there eventually.</span>
This is, with some qualifications, a good thing: in more formal approaches,
tight coupling between the machine learning and the representation means that
there’s always a risk that some new semantic phenomenon shows up in the data and
suddenly your model is useless. Sufficiently generic machinery for
learning (non-logical) representations makes this a little less scary.</p>
<p>But the attitude of the end-to-end world seems to be that since we’re no longer
doing logical inference, there’s no need to think about meaning at all. Suddenly
everyone loves to cite Wittgenstein to argue that we should be
evaluating “language understanding” in terms of success on downstream tasks
rather than predicting the right logical forms
<a href="https://arxiv.org/abs/1606.02447">[WLM16]</a>
<a href="https://arxiv.org/abs/1610.03585">[GM16]</a>
<a href="https://arxiv.org/abs/1612.07182">[LPB16]</a>—which is great!—but underlying
this seems to be a philosophy that “meaning is use, so if we can predict use with
high accuracy we’ve understood everything we need to about meaning”.
<span class="aside">
I’ve never understood this to be the claim in <em>Philosophical
Investigations</em>—even if use (rather than reference) is the primary thing we
should be trying to explain, <em>PI</em> is very interested in the kinds of <del>mental
representations</del> representations of the processes (?) in virtue of which
language use is possible.
</span>
Particularly given that we have not actually solved “use”, I think machine
learning has both lots to learn and lots to say about the meaning side of the
equation as well.</p>
<p>In this post I want to motivate the use of explicit representations of belief
states of the form <script type="math/tex">p(\textsf{world state} \mid \textsf{utterance})</script> <em>as</em>
representations of meaning suitable for “unstructured” machine learning models.
These kinds of representations arise naturally in the sorts of decision-making
tasks the community is excited about about these days, but they also look a lot
like classical representational theories in linguistics. The synthesis suggests
ways of both training and interpreting language processing models.</p>
<h2 id="belief-states-and-intensions">Belief states and intensions</h2>
<p>Consider the problem of trying to act in a partially observed world where people
talk to you in order to help reduce your uncertainty. How should you choose the
best possible action to take? Given a single utterance <script type="math/tex">w</script>, and possible true
states of the world <script type="math/tex">x</script>, the min Bayes risk action is</p>
<script type="math/tex; mode=display">\arg\min_a \int_x p(x \mid w) \; R(a;\,x)</script>
<p>for some risk function <script type="math/tex">R</script>. Any listener who hopes to succeed in the world
needs to do a good job of at least approximately solving this optimization
problem, and in practice the listener will probably need to represent the
distribution <script type="math/tex">p(x \mid w)</script>, at least implicitly. In POMDP-land we call <script type="math/tex">p(x
\mid w)</script> a <em>belief state</em>; for a given <script type="math/tex">w</script>, it’s a function that maps from
possible worlds <script type="math/tex">x</script> to scalar plausibility judgments—how likely is it that
<script type="math/tex">x</script> is the true world given that we observed someone saying <script type="math/tex">w</script> about it?</p>
<p>Compare this to the notion of an <em>intension</em> in Montague semantics: “a function
from possible worlds and moments of time to truth values”
<a href="https://plato.stanford.edu/entries/montague-semantics/#IntTau">[J11]</a>. Most
(model-theoretic) semantics programs represent intensions using logical
expressions (rather than e.g. tabularly). But a logical form is just one way of
expressing a function of the right type; at the end of the day, an “explicit
representation of meaning” to the Montagovian tradition is precisely an
intension—that is, something that looks like a discretized version of our
<script type="math/tex">p(x \mid w)</script>.</p>
<p><em>A belief state is an intension with probabilities</em>. Intensional representations
of meaning are useful not just because they help us solve linguistic
problems, but also because they approximate a quantity that we know helps
language users <em>do</em> useful things with the information they’ve acquired from
language. From the other side, what the POMDP tells us we have to compute
upon hearing an utterance is approximately the thing linguists have been telling
us to compute all along.
<span class="aside">
Or almost the thing linguists have been telling us about—what would be even
better than a black box for answering <script type="math/tex">p(x \mid w)</script> queries would be something
with a little structure, maybe some kind of factorized representation that let
us find the MBR action efficiently by making it possible to inspect the set of
properties that all plausible worlds have in common. Perhaps a product of
assertions about individuals, their properties, and their relations…. If
logical semantics didn’t exist we would have had to invent it.
</span></p>
<p><script type="math/tex">p(x \mid w)</script> qua “meaning” should be precisely understood as a <em>listener
meaning</em>: an accurate belief state already accounts for Gricean
speaker-meaning-type effects (e.g. implicatures) and also further inferences the
speaker may <em>not</em> have wanted the listener to draw (e.g. the possibility that
<script type="math/tex">w</script> is a lie). Our story here doesn’t care where <script type="math/tex">p(x \mid w)</script> comes from,
so it might be computed via something like RSA
<a href="http://science.sciencemag.org/content/336/6084/998">[FG12]</a> with a distinct
notion of sentence meaning embedded inside.</p>
<p>One last adjustment: real-world listeners don’t start with a <em>tabula rasa</em>:
every utterance is interpreted in the context of an existing belief state
<script type="math/tex">p(x)</script>, and we really want to think of the meaning of a sentence as an <em>update
function</em>; i.e. <script type="math/tex">p(x) \mapsto p(x \mid w)</script> rather than just <script type="math/tex">p(x \mid w)</script>.
For sentences of the “Pat loves Lou” variety I think this update is basically
always conjunctive; i.e. <script type="math/tex">p(x) \mapsto (1/Z) \cdot p(x) \cdot p(x \mid w)</script>.
But the general version is necessary for dealing with indexicals and
<a href="translation-meaning.html">Quine’s problems</a> with the denotation of <em>bachelor</em>.</p>
<h2 id="practical-implications">Practical implications</h2>
<p>All very nice, but we led by noting that explicit denotational meaning
representations (logical, probabilistic, or otherwise) don’t actually show up in
the kinds of models that work well in practice. So why does any of this matter?</p>
<p>For language understanding systems to work well, they must be choosing something
close to the min Bayes risk action. Hand-wavingly: a suffix of a deep network is
a function from input representations to output actions via a fixed circuit; if
this suffix can pick a good action for every input representation it’s
implementing something like an MBR decoding algorithm (though perhaps
approximate and specialized to the empirical distribution over representations);
whatever representation of language-in-context is presented to this part of the
network must then be sufficient to solve the optimization problem, so be
something like a representation of <script type="math/tex">p(x \mid w)</script>.</p>
<p>This is not a great argument: there may in fact be no
clear distinction between the “sentence representation” and “optimization” parts
of the model. But in practice we do see meaning-like sentence representations
emerge (especially in models where the sentence representation is computed
<em>independent</em> of whatever initial information the listener has about the state
of the world
<a href="https://distill.pub/2018/feature-wise-transformations/">[DP+18]</a>).
When using specialized optimization modules within larger networks
<a href="https://arxiv.org/abs/1602.02867">[TW+17]</a>
<a href="https://arxiv.org/abs/1805.02777">[LFK18]</a>
we can be sure of the distinction.</p>
<p>In any case, the knowledge that some intermediate representation in our model is
(or should be) decodable into a distribution over world states gives us two
tools:</p>
<p><strong>Interpretability:</strong> test whether representations are capturing the right
semantics (or identify what weird irregularities they’re latching onto) by
estimating <script type="math/tex">p(x \mid \textrm{rep}(w))</script>, where <script type="math/tex">\textrm{rep}(w)</script> is the
model’s learned representation of the utterance <script type="math/tex">w</script>. Determine whether this
corresponds to the real (i.e. human listener’s) denotation of <script type="math/tex">w</script>. We got a
bunch of mileage out of this technique in our neuralese papers
<a href="https://arxiv.org/abs/1704.06960">[ADK17]</a>
<a href="https://arxiv.org/abs/1707.08139">[AK17]</a> and other students in the group have
been using it recently to analyze pretraining schemes for instruction-following
models. But in some ways it’s even more natural to apply it to the learning of
representations of natural language itself rather than a learned space of
messages / abstract actions.</p>
<p><strong>Auxiliary objectives:</strong> the normal objective for an instruction following / QA
problem is <script type="math/tex">p(\textsf{action} \mid \textsf{utterance}, \textsf{listener
obs})</script>. But if overfitting is an issue, it’s easy enough to tack on an extra
term of the form <script type="math/tex">p(\textsf{speaker obs, listener obs} \mid
\textsf{utterance})</script> if it’s available. For some problems (e.g. GeoQuery-style
semantic parsing) there isn’t a meaningful distinction between “speaker
observation” and “action”; for others it looks like a totally different learning
problem. For referring expression games the denotational auxiliary problem is
“generate / retrieve pairs of images for which this would be a discriminative
caption”; for instruction following models it’s “generate the goal state (but
not necessarily the actions that get me there)”.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Thinking about POMDP-style solutions to language games results in an account of
meaning that looks suspiciously like model-theoretic semantics. This analogy
provides tools for interpreting learned models and suggests auxiliary objectives
for improving their accuracy.</p>
Fri, 27 Jul 2018 00:00:00 +0000
http://blog.jacobandreas.net/meaning-belief.html
http://blog.jacobandreas.net/meaning-belief.htmlFake language: two perspectives<p>Is there a role for fake language datasets in the AI ecosystem? (By “fake
language” I mean things like
<a href="https://research.fb.com/downloads/babi/">bAbI</a>,
<a href="http://cs.stanford.edu/people/jcjohns/clevr/">CLEVR</a>,
Karthik’s <a href="http://people.csail.mit.edu//karthikn/assets/pdf/mud-play15.pdf">“Home” world</a>
[pdf link], and
<a href="https://arxiv.org/abs/1710.09867">this DeepMind paper</a>.)
Claims that various learning architectures can “do language processing” based on
results from these datasets have caused a lot of hand-wringing in the NLP
community. While a lot of this is due to good old-fashioned overclaiming, I’ve
become persuaded some of it is miscommunication between two groups that mean
totally different things by “language data”.</p>
<p>I’m going to focus here on instruction following, but I think there’s a similar
story to tell about lots of other grounded tasks like question answering,
generation, etc.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> For a long time, an AI researcher’s view of an instruction
following problem was something like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Language -> Abstraction -> Behavior
</code></pre></div></div>
<p>That is, we start with whatever utterances people are generating, map them to
some kind of clean, structured representation, and then make decisions about how
to act on the basis of that structure. Because this whole pipeline was too hard
to tackle all at once, the community mostly started working on it from different
ends.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<hr />
<p>“Language people” worked on this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Language -> Abstraction
</code></pre></div></div>
<p>In this picture, language comes to you from the outside world—you don’t
control the distribution. You get to design the language of abstractions, but it
had better be able to handle (or at least fail gracefully on) whatever
utterances the world throws at you. Linguists give us a nice abstraction
formalism in the form of logic, and that the way to get from abstraction to
behavior is just logical interpretation. So it’s very easy to say “abstraction =
formal semantics” and treat the <code class="language-plaintext highlighter-rouge">Abstraction -> Behavior</code> edge as someone else’s
problem.</p>
<p>Data is <em>collected</em> from human speakers who don’t necessarily know anything
about logical forms. Indeed, decisions about details of the logical language are
typically worked out after collecting initial annotations. What distinguishes
“language data” from other kinds of data is precisely the fact that it was
generated by human users. (If we’re generating data from a fake grammar and
mapping it onto logical forms, we generally haven’t learned anything about
language that we didn’t write down in the first place.)</p>
<hr />
<p>“Policy people” (broadly understood to include everything from RL to planning
to classical control) worked on this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Abstraction -> Behavior
</code></pre></div></div>
<p>In this picture, the scope of acceptable abstractions is up to the system
designer—it’s in behavior where details of the real world (physics, etc.)
intervene. Abstraction languages range from “do one of these 10 specific things”
to “satisfy this STRIPS goal”. In particular, an abstraction language
that doesn’t support all possible goals is no more problematic than a remote
control that doesn’t operate all appliances at once.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> The thing that I
originally found difficult to appreciate is just how hard some of these problems
are <em>even when we have complete control over the input distribution</em>.
Reinforcement learning is <em>hard</em>. Planning is <em>hard</em>. There’s still lots of
room for interesting compositionality in these abstraction languages—if I have
some kind of structured representation of the goal, and I train on a subset of
structures, do I generalize to the rest? There’s lots we still can’t do.</p>
<p>To come up with problems that are within reach of current methods, data is
<em>generated</em> rather than collected. The distribution over abstractions and their
induced behaviors is hand-designed. There’s no language data here; what
distinguishes “language data” from what does get used is that language has no
precise execution semantics, but e.g. STRIPS does.</p>
<hr />
<p>In the last few years these two communities have run together, because the world
looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Language -> Behavior
</code></pre></div></div>
<p>Everything is end-to-end all the time. Abstraction isn’t gone, but now it lives
in some uninterpretable representation space rather than a formalism we’ve
designed by hand. This is a great thing! Language people no longer have to limit
themselves to worlds where they’re clever enough to construct a good enough
logical language.</p>
<p>And policy people (here’s where the trouble starts) no longer have to describe
their task inventory in terms of any particular formalism: they just need some
way of generating reward functions / goal tests synchronously with some kind of
(compositional?) identifier that describes them. So they generate interpretable
strings made of sequences of words. No execution semantics, uses English words:
natural language. Whence the confusion.</p>
<p>I think there’s a lesson in this for people in both communities:</p>
<ul>
<li>
<p>For “policy people” as researchers: <em>please please please</em> signpost
explicitly when your input strings were generated synthetically. The word
<em>language</em> is hopelessly overloaded at this point, but the bigram <em>natural
language</em> is not: avoid using the word <em>natural</em> unless people were involved.
(A couple of the papers linked in the first paragraph are guilty of this.)</p>
</li>
<li>
<p>For “language people” as reviewers: respond to appropriately qualified fake
language datasets by asking “Is an interesting <code class="language-plaintext highlighter-rouge">Abstraction -> Behavior</code>
problem is being solved? Do strings index the target class of behaviors in an
interesting way?” There are lots of problems out there for which this is an
appropriate standard.</p>
</li>
</ul>
<p>I think we’re still at a stage where there’s something to learn from fake
language, even those of us who ultimately care about the distribution of
sentences produced by humans.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Explicitly excluded from this discussion is work on doing linguistic
analysis of fake language data. Such work is not about language at all,
and is either about analyzing the formal expressive capacity of certain
model classes, or is garbage. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Not everyone! Both <a href="https://cs.brown.edu/~stefie10/">Stefanie Tellex</a> and
<a href="http://people.csail.mit.edu/branavan/">Branavan</a> had work that was trying to
tackle the complete pipeline well before the current end-to-end craze. I
think this research is a model for what the field should be trying to
accomplish, but for lots of problems our techniques just aren’t there
yet. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>There’s a sense in which Jonathan Berant and Percy Liang’s work on
<a href="http://www.aclweb.org/anthology/P14-1133">paraphrasing for semantic parsing</a>
[pdf link] actually belongs in this category, rather than <code class="language-plaintext highlighter-rouge">Language ->
Abstraction</code>. But they still hold themselves to a “real language”
evaluation standard. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 04 Nov 2017 00:00:00 +0000
http://blog.jacobandreas.net/fake-language.html
http://blog.jacobandreas.net/fake-language.htmlLearning to reason with neural module networksWed, 17 May 2017 00:00:00 +0000
http://blog.jacobandreas.net/nmns.html
http://blog.jacobandreas.net/nmns.htmllinkA neural network is a monference, not a model<p>The distinction between <em>models</em> and <em>inference procedures</em> is central to most
introductory presentations of artificial intelligence. For example: HMMs are a
class of model; the Viterbi algorithm is one associated inference procedure,
the forward–backward algorithm is another, and particle filtering is a third.</p>
<p>Many people describe (and presumably think of) neural networks as a class of
models. I want to argue that this view is misleading, and that it is more useful
to think of neural networks as hopelessly entangled model–inference pairs.
“Model–inference pair” is a mouthful, and there doesn’t seem to be good
existing shorthand, so I will henceforth refer to such objects as “monferences”.
My claim is that we should think of a neural network as an example of a
monference. (An implementation of the Viterbi algorithm, equipped with the
parameters of some <em>fixed</em> HMM, is also a monference.)</p>
<p>I’m about to cite a bunch of existing papers that follow naturally from the
neural-nets-as-monferences perspective—it seems like this idea is already
obvious to a lot of people. But I don’t think it’s been given a name or a
systematic treatment, and I hope others will find what follows as useful
(or at least as deconfounding) as I did.</p>
<hr />
<p>What are the consequences of regarding a neural net as a <em>model</em>?
A personal example is illustrative:</p>
<p>The first time I saw a recurrent neural network, I thought “this is an
interesting model with a broken inference procedure”. A recurrent net looks like
an HMM. An HMM has a discrete hidden state, and a recurrent net has a
vector-valued hidden state. When we do inference in an HMM, we maintain a
distribution over hidden states consistent with the output, but when we do
inference in a recurrent net, we maintain only a single vector—a single
hypothesis, and a greedy inference procedure. Surely things would be better if
there were some way of modeling uncertainty? Why not treat RNN inference like
Kalman filtering?</p>
<p>This complaint is <em>wrong</em>. Our goal in the remainder of this post is to explore
why it’s wrong.</p>
<hr />
<p>Put simply, there is no reason to regard the hidden state of a
recurrent network as a single hypothesis. After all, a sufficiently large hidden
vector can easily represent the whole table of probabilities we use in the
forward algorithm—or even represent the state of a particle filter. The
analogy “HMM hidden state = RNN hidden state” is bad; a better analogy is “HMM
<em>decoder</em> state = RNN hidden state”.</p>
<!--
This _temptation to form false analogies_ is especially appealing to those of us
who grew up in the graphical models culture, and are accustomed to inference
design problems that look algorithmic. But it's only one of a variety of failure
modes associated with the neural-nets-as-models perspective. Another
failure mode seems to preferentially afflict people from the neural nets
culture, who have never needed to think about inference at all: this is a
_failure to reason about computation_.
I don't want to pick on anyone individually. But there seems to be a recent
trend of papers that start with a basic RNN, observe that it can't solve some
simple algorithmic or reasoning problem, and conclude that some crazy new
architecture is necessary—when often it would have been enough to let the
RNN run for more steps, or make a minor change to kind of recurrent unit used.
I think people get in the habit of saying "everything is a function
approximator, and all function approximators are basically comparable". Whereas
if we say "everything is a program", these fair comparison issues become more
complicated: we have to start worrying about equal runtimes, availability of the
right floating point operations, etc. But when building inference procedures,
these are exactly the things we should worry most about!
***
-->
<p>Let’s look at this experimentally. (Code for
this section can be found in the accompanying <a href="https://github.com/jacobandreas/blog/blob/gh-pages/notebooks/monference.ipynb">Jupyter
notebook</a>.)</p>
<p>If we think about the classical inference procedure with the same structure as a
(uni-directional) recurrent neural network, it’s something like this: for <script type="math/tex">t =
0..n</script>, receive an emission <script type="math/tex">x_t</script> from the HMM, and <em>immediately</em> predict a
hidden state <script type="math/tex">y_t</script>. You should be able to convince yourself that if we’re
evaluated on tagging accuracy, the min-risk monference (if HMM parameters are
known) is to run the forward algorithm, and predict the tag with maximum
marginal probability at each time <script type="math/tex">t</script>.</p>
<p>I generated a random HMM, drew a bunch of sequences from it, and
applied this min-risk classical procedure. I obtained the following
“online tagging” accuracy:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">62.8</code></pre></figure>
<p>Another totally acceptable (though somewhat more labor-intensive) way of
producing a monference for this online tagging problem is to take the HMM, draw
many more samples from it, and use the (observed, hidden) sequences as training
data (x, y) for a vanilla RNN of the following form:</p>
<p><img src="figures/monference_rnn.png" style="width: 300px; max-width: 100%" /></p>
<p>(where each arrow is an inner product followed by a ReLU or log-loss). In this
case I obtained the following accuracy:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">62.8</code></pre></figure>
<p>Is it just a coincidence that these scores are the same? Let’s look at some
predictions:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">true hidden sequence: 1 0 0 1 0
classical monference: 1 0 1 1 0
neural monference: 1 0 1 1 0
true hidden sequence: 1 0 1 2 0
classical monference: 1 0 1 0 0
neural monference: 1 0 1 0 0</code></pre></figure>
<p>So even when our two monferences are wrong, they’re wrong in the same way.</p>
<p>Of course, we know that we can get slightly better results for this problem by
running the full forward-backward algorithm, and again making max-marginal
predictions. This improved classical procedure gave an accuracy of:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">63.3</code></pre></figure>
<p>better than either of the online models, as expected. Training a bidirectional
recurrent net</p>
<p><img src="figures/monference_bdrnn.png" style="width: 300px; max-width: 100%" /></p>
<p>on samples from the HMM gave a tagging accuracy of:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">63.3</code></pre></figure>
<p>A sample prediction:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">true hidden sequence: 0 1 0 0 1
classical monference: 0 1 0 0 1
neural monference: 0 1 0 0 1</code></pre></figure>
<p>Notice: the neural nets we’ve used here don’t encode anything
about classical message-passing rules, and they definitely don’t encode anything
about the generative process underlying an HMM. Yet in both cases, the neural
net managed to achieve accuracy as good as (but no better than) the classical
message passing procedure with the same structure. Indeed, this neural training
procedure results in a piece of code that makes identical predictions to the
forward–backward algorithm, but it doesn’t know anything about the
forward–backward algorithm!</p>
<hr />
<p>Neural networks are not magic—when our data is actually generated from an HMM,
we can’t hope to beat an (information-theoretically optimal) classical
monference with a neural one. But we can empirically do just as well.
As we augment neural architectures to match the <em>algorithmic structure</em> of more
powerful classical inference procedures, their performance improves.
Bidirectional recurrent nets are better than forward-only ones; bidirectional
networks with <a href="http://arxiv.org/abs/1602.08210">multiple layers between each “real” hidden
vector</a> might be even better for some tasks.</p>
<p>Better yet, we can perhaps worry less about harder cases, when we previously
would have needed to hand-tune some approximate inference scheme. (One example:
suppose our transition matrix is a huge permutation. It might be very expensive
to do repeated multiplications for classical inference, and trying to take a
low-rank approximation to the transition matrix will lose information. But a
neural monference can potentially represent our model dynamics quite
compactly.)</p>
<p>So far we’ve been looking at sequences, but analogues for more structured data
exist as well. For tree-shaped problems, we can run something that looks like
the <a href="http://www.socher.org/uploads/Main/SocherBauerManningNg_ACL2013.pdf">inside algorithm over a fixed
tree</a> or
the <a href="https://aclweb.org/anthology/D/D15/D15-1137.pdf">inside–outside algorithm over a whole sparsified parse
chart</a>. For arbitrary graphs,
we can apply repeated <a href="http://arxiv.org/pdf/1509.09292.pdf">“graph
convolutions”</a> that start to look a lot
like belief propagation.</p>
<p>There’s a general principle here: anywhere you have an inference algorithm that
maintains a distribution over discrete states, instead:</p>
<ol>
<li>replace {chart cells, discrete distributions} with vectors</li>
<li>replace messages between cells with recurrent networks</li>
<li>unroll the “inference” procedure for a suitable number of iterations</li>
<li>train via backpropagation</li>
</ol>
<p>The resulting monference has at least as much capacity as the corresponding
classical procedure. To the extent that appproximation is necessary, we can (at
least empirically) <em>learn</em> the right approximation end-to-end from the training
data.</p>
<p>(The version of this that backpropagates through an approximate inference
procedure, but doesn’t attempt to learn the inference function itself, has <a href="http://cs.jhu.edu/~jason/papers/#stoyanov-ropson-eisner-2011">been
around</a>
for <a href="http://www.cs.cmu.edu/~mgormley/papers/gormley+dredze+eisner.tacl.2015.pdf">a while</a>.)</p>
<p>I think there’s at least one more constituency parsing paper to be written
using all the pieces of this framework, and lots more for working with
graph-structured data.</p>
<hr />
<p>I’ve argued that the monference perspective is useful, but is it true? That is,
is there a precise sense in which a neural net is <em>really</em> a monference, and not
a model?</p>
<p>No. There’s a fundamental identifiability problem—we can’t really distinguish
between “fancy model with trivial inference” and “mystery model with complicated
inference”. Thus it also makes no sense to ask, given a trained neural network,
which model it performs inference for. (On the other hand, networks trained via
<a href="http://arxiv.org/abs/1503.02531">distillation</a> seem like good candidates for
“same model, different monference”.) And the networks-as-_models_ perspective
shouldn’t be completely ignored: it’s resulted in a fruitful line of work that
replaces log-linear potentials with neural networks
<a href="https://arxiv.org/abs/1507.03641">inside CRFs</a>.
(Though one of the usual selling points of these methods is that “you get to
keep your dynamic program”, which we’ve argued here is true of suitably
organized recurrent networks as well.)</p>
<p>In spite of all this, as research focus in this corner of the machine learning
community shifts towards <a href="http://nips2015.sched.org/event/4G4h/reasoning-attention-memory-ram-workshop">planning, reasoning, and harder algorithmic
problems</a>,
I think the neural-nets-as-monferences perspective should dominate.</p>
<p>More than that—when we look back on the “deep learning revolution” ten years
from now, I think the real lesson will be the importance of end-to-end training
of decoders and reasoning procedures, even in systems that <a href="http://arxiv.org/abs/1601.01705">barely
look
like neural networks at all</a>. So when building
learning systems, don’t ask: “what is the probabilistic relationship among my
variables?”. Instead ask: “how do I approximate the inference function for my
problem?”, and attempt to learn this approximation directly. To do this
effectively, we can use everything we know about classical inference procedures.
But we should also start thinking of inference as a first-class part of the
learning problem.</p>
<hr />
<p>Thanks to Matt Gormley (whose EMNLP talk got me thinking about these issues), and
Robert Nishihara and Greg Durrett for feedback.</p>
<p>Also Jason Eisner for this gem: “An awful portmanteau, since monference should be a count
noun like model, but you took the suffix from the mass noun. Not that I can claim that infedel
is much better…”</p>
Thu, 18 Feb 2016 00:00:00 +0000
http://blog.jacobandreas.net/monference.html
http://blog.jacobandreas.net/monference.htmlPlanning in representation space<p>Agents parameterized by neural nets (Atari players etc.) seem to universally
suffer from an ability to plan. This is obvious in the case of Markov reflex
agents like vanilla deep Q learners, and seems to be true even of agents with
some amount of hidden state (like the MemN2N paper at NIPS). Nevertheless
planning-like behaviors have been successfully applied to other deep models,
most notably text generation—beam decoding, and even beam-aware training, seem
to be essential for both MT and captioning. And of course, real planning is
ubiquitous among people working on non-toy control problems.</p>
<p>Task and motion planning is a good example. At the end of the day, we need to
solve a continuous control problem, but attempting to solve this directly
(either with a generic control policy or TrajOpt-like procedure) is too hard.
Instead we come up with some highly simplified, hand-specified encoding of
the problem—perhaps a STRIPS representation that discards geometry. We solve
the (comparatively easy) STRIPS planning problem, and then project back down
into motion planning space. This projection might not correspond to a feasible
policy! (But we want things feasible in task space to be feasible in motion
space as much as possible.) We keep searching in planning space until we find a
solution that also works in task space.</p>
<p>This is really just a coarse-to-fine pruning scheme—we want a cheap way to
discard plans that are obviously infeasible, so we can devote all of our
computational resources to cases that really require simulation.</p>
<p>We can represent this schematically:</p>
<p><img src="figures/planning_representations_diagram.png" style="width: 30%" /></p>
<p>Here we have a representation function <script type="math/tex">r</script>, a true cost function <script type="math/tex">c</script> (which
we may want to think of as a 0–1 feasibility judgment), and a “representation
cost” <script type="math/tex">k</script>. We want to ensure that <script type="math/tex">r</script> is “close to an isomorphism” from
motion costs to task costs, in the sense that <script type="math/tex">c(s_1, s_2) \approx k(r(s_1),
r(s_2))</script>.</p>
<p>For the STRIPS version, we assume that <script type="math/tex">r</script> and <script type="math/tex">k</script> are given to us by hand.
But can we <em>learn</em> a better representation than STRIPS for solving task and
motion planning problems?</p>
<h2 id="learning-from-example-plans">Learning from example plans</h2>
<p>First suppose that we have training data in the form of successful sequences
with motion-space waypoints <script type="math/tex">(s_1, s_2, \ldots, s^*)</script>. Then we can directly
minimize an objective of the form</p>
<script type="math/tex; mode=display">L(\theta) = \sum_i \left[c(s_i, s_{i+1}) - k_\theta(r_\theta(s_i),
r_\theta(s_{i+1}))\right]^2</script>
<p>for <script type="math/tex">r</script> and <script type="math/tex">k</script> parameterized by <script type="math/tex">\theta</script>. Easiest if representation space (the codomain
of <script type="math/tex">r</script>) is <script type="math/tex">\mathbb{R}^d</script>; then we can manipulate <script type="math/tex">d</script> to control the tradeoff
between representation quality and the cost of searching in representation space.</p>
<p>Problem: if we only ever observe constant <script type="math/tex">c</script> (which might be the
case if we only see good solutions), there’s no pressure to learn a nontrivial
<script type="math/tex">k</script>. So we also want examples of unsuccessful attempts.</p>
<h2 id="decoding">Decoding</h2>
<p>Given a trained model, we can solve new instances by:</p>
<ol>
<li>Sample a cost-weighted path through representation space <script type="math/tex">(r_1, r_2, ..., r_n)</script>
such that <script type="math/tex">r(s^*) \approx r_n</script>.</li>
<li>Map each representation space transition <script type="math/tex">r_1 \to r_2</script> onto a motion space
transition <script type="math/tex">s_1 \to s_2</script> such that <script type="math/tex">r(s_2) \approx r_2</script>. (Easily
expressed as an opt problem if <script type="math/tex">r</script> is differentiable, but harder as a
policy.)</li>
<li>Repeat until one of the motion-space solutions is feasible.</li>
</ol>
<p>At every step that involves computing a path (whether in <script type="math/tex">r</script>-space or
<script type="math/tex">s</script>-space, we can use a wide range of possible techniques, whether
optimization-based (TrajOpt), search-based (RRT, though probably not in high
dimensions), or by learning a policy parameterized by the goal state.</p>
<h2 id="learning-directly-from-task-feedback">Learning directly from task feedback</h2>
<p>What if we don’t have good traces to learn from? Just interleave the above two
steps—starting from random initialization,
roll out to a sequence of predicted <script type="math/tex">r</script> and <script type="math/tex">s</script>, then treat this as
supervision, and again update <script type="math/tex">k</script> to reflect observed costs.</p>
<h2 id="informed-search">Informed search</h2>
<p>So far we’re assuming we can just brute-force our way through representation
space until we get close to the goal. There’s nothing to enforce that closeness
in representation space corresponds to closeness in motion space (other than the
possible smoothness of <script type="math/tex">r</script>). We might want to add an additional constraint
that if <script type="math/tex">r_i</script> is definitely three hops from <script type="math/tex">r_n</script>, then <script type="math/tex">||r_i - r_n|| >
||r_{i+1} - r_n||</script> or something similar. This immediately provides a useful
heuristic for the search in representation space.</p>
<p>We can also use side-information at this stage—maybe advice provided in the
form of language or a video demonstration. (Then we need to learn another
mapping from advice space to representation space.)</p>
<h2 id="modularity">Modularity</h2>
<p>It’s common to define several different primitive operations in the STRIPS
domain—e.g. just “move” and “grasp”. We might analogously want to give our
agent access to a discrete inventory of different policies, with associated
transition costs <script type="math/tex">k_1, k_2, \ldots</script>. Now the search problem involves both
(continuously) choosing a set of points, and (discretely) choosing cost
functions / motion primitives for moving between them. The associated motions of
each of these primitives might be confined to some (hand-picked) sub-manifold of
configuration space (e.g. only move the end effector, only move the first
joint).</p>
<hr />
<p>Thanks to Dylan Hadfield-Menell for useful discussions about task and motion
planning.</p>
Sun, 17 Jan 2016 00:00:00 +0000
http://blog.jacobandreas.net/planning-representations.html
http://blog.jacobandreas.net/planning-representations.htmlPrograms made of neural networks<p>A crude history of applied machine learning:</p>
<blockquote>
<p>Whenever we have a low-capacity model with hand-engineered structural
constraints, and replace it with a high-capacity model with simple features
and few structural constraints, model quality improves [models are smaller,
take less time to develop, and generalize better to unseen data].</p>
</blockquote>
<p>See (in NLP): linear models replacing decision lists, Jelinek’s infamous
linguists, statistical machine translation, the recent flurry of papers whose
entire substance is “replace this log-linear model (a two-layer neural net) with
a three-layer neural net”.</p>
<p>A crude history of programming languages:</p>
<blockquote>
<p>Whenever we have a programming language with lots of simple constructs, and we
replace them with a few high-level constructs, program quality
improves [programs of equivalent complexity are shorter, take less time to
develop, and are less likely to contain bugs].</p>
</blockquote>
<p>First everyone stopped writing assembly, then everyone stopped writing C.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>Notice: in each case we’re moving the same slider—just moving it in different
directions. Machine learning and programming language design ultimately have the
same goal: making it as easy as possible for certain problem-solving machines
(whether those are humans or optimization algorithms) to produce correct code
according to some specification. Out in the real world, we don’t favor machine
learning because it is inherently more pure or beautiful than writing code by
hand—we use it because it’s <em>effective</em>. If somebody released a library today
with a bunch of composable vision primitives, and suddenly Facebook to could
solve all their image-labeling problems more effectively with interns rather
than neural nets, then neural nets would be out the door tomorrow.</p>
<p>Actually, can we write this library now?</p>
<p>To be clear, I don’t mean something like OpenCV, where you take a bunch of
pre-implemented models for particular tasks and then do whatever
stitching-together you want in postprocessing. Instead, again, some notion of
little vision primitives from which it would be possible to write a classifier
as</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"> <span class="nf">load</span><span class="o">(</span><span class="n">image</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">detectObjects</span> <span class="n">andThen</span>
<span class="nf">orderBy</span><span class="o">(</span><span class="n">salience</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">head</span> <span class="n">andThen</span>
<span class="n">name</span></code></pre></figure>
<p>or a captioner as</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"> <span class="nf">load</span><span class="o">(</span><span class="n">image</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">detectObjects</span> <span class="n">andThen</span>
<span class="n">describeAll</span></code></pre></figure>
<p>or a face detector as</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"> <span class="nf">load</span><span class="o">(</span><span class="n">image</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">detectObjects</span> <span class="n">andThen</span>
<span class="nf">filter</span><span class="o">(</span><span class="nf">name</span><span class="o">(</span><span class="k">_</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Face</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">drawBoundaries</span></code></pre></figure>
<p>What do the functions <code class="language-plaintext highlighter-rouge">detectObjects</code>, <code class="language-plaintext highlighter-rouge">describeAll</code>, etc. look like? Current
experience suggests that they should be neural nets, but neural nets of a very
particular kind: rather than being trained to accomplish some particular task
(like image captioning), they’re trained to be freely composable: <code class="language-plaintext highlighter-rouge">describeAll</code>
promises to take anything “like a list of detections” (whether directly from
<code class="language-plaintext highlighter-rouge">detectObjects</code> or subsequently filtered) and produce a string. Note in
particular that the inputs and outputs to these functions are all real
vectors. There is no way to structurally enforce that a thing “like a list of
detections” actually has the desired semantics, and instead we rely entirely on
the training procedure.</p>
<p>In current real-world implementations, there’s a notion of <em>layers</em> as modular,
pre-specified units, but <em>networks</em> as monolithic models customized for specific
tasks (and requiring end-to-end training). Once we move to modular networks,
though, we can start to perform tasks for which no training data exists. For
example, “write a caption about the people in this image”:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"> <span class="nf">load</span><span class="o">(</span><span class="n">image</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">detectObjects</span> <span class="n">andThen</span>
<span class="nf">filter</span><span class="o">(</span><span class="nf">name</span><span class="o">(</span><span class="k">_</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Face</span><span class="o">)</span>
<span class="n">andThen</span> <span class="n">describeAll</span></code></pre></figure>
<p>using the same primitives specified above.</p>
<p>Steps we are already taking in this direction: the fact that people use a prefix
of an image classification network to initialize models for basically every
other vision task; the fact that “attention” is suddenly being treated as a
primitive in model descriptions even though it’s a complicated sequence of
operations for combining multiple layers. Roger Grosse’s beautiful <a href="http://www.cs.toronto.edu/~rgrosse/uai2012-matrix.pdf">paper on
grammars over matrix factorization
models</a> also kind of
looks like this, and Christopher Olah has a <a href="http://colah.github.io/posts/2015-09-NN-Types-FP/">discussion of the type-theoretic
niceties</a> of neural nets
understood as collections of reusable modules (though to me this seems largely
secondary to the practical question of what these types are).</p>
<p>To bring this back to the earlier programming language discussion, we observe
that:</p>
<ol>
<li>
<p>It’s hard for a person to write down a person-detector by hand, but easy for
a neural net.</p>
</li>
<li>
<p>Given appropriate functional vision primitives, it’s easy for a person to
write down a person <em>describer</em>. But training a neural net to do this from
scratch requires a lot of examples of people descriptions to do this. (We
might then say it’s “easy” for people but “hard” for a neural net.)</p>
</li>
</ol>
<p>To take this yet a step further, we can note that there are lots of machine
learning techniques that are more human-like than neural-net-like, in the sense
that they do well with tiny data sets and a good pre-specified inventory of
primitives (e.g. program induction, semantic parsing). If we really just care
about minimal human intervention, we can figure out our vision primitives and
then hand them off to a machine learning subsystem of an entirely different
kind.</p>
<p>So let’s write this library! There are research questions here: First, what is
the right set of functional primitives to give people (or models for program
induction)? Next, can these shared representations actually be learned? How do
we find parameter settings for these modules using the kinds of labeled data
currently available?</p>
<p>Disclosure: I already have a model like this working on a bunch of simple
question-answering tasks about images—I think it’s a really exciting
proof-of-concept, and I’ll hopefully be able to show it off soon. But it’s not a
comprehensive solution (esp. if we want to interface between vision / language /
control applications), and I think there’s a really interesting systems problem
here too.</p>
<hr />
<p>Followup:</p>
<ul>
<li><a href="http://arxiv.org/abs/1511.02799">Deep compositional question answering with neural networks</a></li>
<li><a href="http://arxiv.org/abs/1601.01705">Learning to compose neural networks for question answering</a></li>
</ul>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Obviously this is a gross overstatement, since lots of people do continue to
write assembly and C. But I think it’s less controversial to say that <em>fewer</em>
people write in low-level languages, and that it’s harder to do so correctly. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 06 Sep 2015 00:00:00 +0000
http://blog.jacobandreas.net/programming-with-nns.html
http://blog.jacobandreas.net/programming-with-nns.html