blogNotes on language, machine learning, etc.
http://blog.jacobandreas.net/
Wed, 28 Jun 2017 23:07:06 +0000Wed, 28 Jun 2017 23:07:06 +0000Jekyll v3.4.3Learning to reason with neural module networksWed, 17 May 2017 00:00:00 +0000
http://blog.jacobandreas.net/nmns.html
http://blog.jacobandreas.net/nmns.htmllinkA neural network is a monference, not a model<p>The distinction between <em>models</em> and <em>inference procedures</em> is central to most
introductory presentations of artificial intelligence. For example: HMMs are a
class of model; the Viterbi algorithm is one associated inference procedure,
the forward–backward algorithm is another, and particle filtering is a third.</p>
<p>Many people describe (and presumably think of) neural networks as a class of
models. I want to argue that this view is misleading, and that it is more useful
to think of neural networks as hopelessly entangled model–inference pairs.
“Model–inference pair” is a mouthful, and there doesn’t seem to be good
existing shorthand, so I will henceforth refer to such objects as “monferences”.
My claim is that we should think of a neural network as an example of a
monference. (An implementation of the Viterbi algorithm, equipped with the
parameters of some <em>fixed</em> HMM, is also a monference.)</p>
<p>I’m about to cite a bunch of existing papers that follow naturally from the
neural-nets-as-monferences perspective—it seems like this idea is already
obvious to a lot of people. But I don’t think it’s been given a name or a
systematic treatment, and I hope others will find what follows as useful
(or at least as deconfounding) as I did.</p>
<hr />
<p>What are the consequences of regarding a neural net as a <em>model</em>?
A personal example is illustrative:</p>
<p>The first time I saw a recurrent neural network, I thought “this is an
interesting model with a broken inference procedure”. A recurrent net looks like
an HMM. An HMM has a discrete hidden state, and a recurrent net has a
vector-valued hidden state. When we do inference in an HMM, we maintain a
distribution over hidden states consistent with the output, but when we do
inference in a recurrent net, we maintain only a single vector—a single
hypothesis, and a greedy inference procedure. Surely things would be better if
there were some way of modeling uncertainty? Why not treat RNN inference like
Kalman filtering?</p>
<p>This complaint is <em>wrong</em>. Our goal in the remainder of this post is to explore
why it’s wrong.</p>
<hr />
<p>Put simply, there is no reason to regard the hidden state of a
recurrent network as a single hypothesis. After all, a sufficiently large hidden
vector can easily represent the whole table of probabilities we use in the
forward algorithm—or even represent the state of a particle filter. The
analogy “HMM hidden state = RNN hidden state” is bad; a better analogy is “HMM
<em>decoder</em> state = RNN hidden state”.</p>
<!--
This _temptation to form false analogies_ is especially appealing to those of us
who grew up in the graphical models culture, and are accustomed to inference
design problems that look algorithmic. But it's only one of a variety of failure
modes associated with the neural-nets-as-models perspective. Another
failure mode seems to preferentially afflict people from the neural nets
culture, who have never needed to think about inference at all: this is a
_failure to reason about computation_.
I don't want to pick on anyone individually. But there seems to be a recent
trend of papers that start with a basic RNN, observe that it can't solve some
simple algorithmic or reasoning problem, and conclude that some crazy new
architecture is necessary—when often it would have been enough to let the
RNN run for more steps, or make a minor change to kind of recurrent unit used.
I think people get in the habit of saying "everything is a function
approximator, and all function approximators are basically comparable". Whereas
if we say "everything is a program", these fair comparison issues become more
complicated: we have to start worrying about equal runtimes, availability of the
right floating point operations, etc. But when building inference procedures,
these are exactly the things we should worry most about!
***
-->
<p>Let’s look at this experimentally. (Code for
this section can be found in the accompanying <a href="https://github.com/jacobandreas/blog/blob/gh-pages/notebooks/monference.ipynb">Jupyter
notebook</a>.)</p>
<p>If we think about the classical inference procedure with the same structure as a
(uni-directional) recurrent neural network, it’s something like this: for <script type="math/tex">t =
0..n</script>, receive an emission <script type="math/tex">x_t</script> from the HMM, and <em>immediately</em> predict a
hidden state <script type="math/tex">y_t</script>. You should be able to convince yourself that if we’re
evaluated on tagging accuracy, the min-risk monference (if HMM parameters are
known) is to run the forward algorithm, and predict the tag with maximum
marginal probability at each time <script type="math/tex">t</script>.</p>
<p>I generated a random HMM, drew a bunch of sequences from it, and
applied this min-risk classical procedure. I obtained the following
“online tagging” accuracy:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">62.8</code></pre></figure>
<p>Another totally acceptable (though somewhat more labor-intensive) way of
producing a monference for this online tagging problem is to take the HMM, draw
many more samples from it, and use the (observed, hidden) sequences as training
data (x, y) for a vanilla RNN of the following form:</p>
<p><img src="figures/monference_rnn.png" style="width: 300px; max-width: 100%" /></p>
<p>(where each arrow is an inner product followed by a ReLU or log-loss). In this
case I obtained the following accuracy:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">62.8</code></pre></figure>
<p>Is it just a coincidence that these scores are the same? Let’s look at some
predictions:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">true hidden sequence: 1 0 0 1 0
classical monference: 1 0 1 1 0
neural monference: 1 0 1 1 0
true hidden sequence: 1 0 1 2 0
classical monference: 1 0 1 0 0
neural monference: 1 0 1 0 0</code></pre></figure>
<p>So even when our two monferences are wrong, they’re wrong in the same way.</p>
<p>Of course, we know that we can get slightly better results for this problem by
running the full forward-backward algorithm, and again making max-marginal
predictions. This improved classical procedure gave an accuracy of:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">63.3</code></pre></figure>
<p>better than either of the online models, as expected. Training a bidirectional
recurrent net</p>
<p><img src="figures/monference_bdrnn.png" style="width: 300px; max-width: 100%" /></p>
<p>on samples from the HMM gave a tagging accuracy of:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">63.3</code></pre></figure>
<p>A sample prediction:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">true hidden sequence: 0 1 0 0 1
classical monference: 0 1 0 0 1
neural monference: 0 1 0 0 1</code></pre></figure>
<p>Notice: the neural nets we’ve used here don’t encode anything
about classical message-passing rules, and they definitely don’t encode anything
about the generative process underlying an HMM. Yet in both cases, the neural
net managed to achieve accuracy as good as (but no better than) the classical
message passing procedure with the same structure. Indeed, this neural training
procedure results in a piece of code that makes identical predictions to the
forward–backward algorithm, but it doesn’t know anything about the
forward–backward algorithm!</p>
<hr />
<p>Neural networks are not magic—when our data is actually generated from an HMM,
we can’t hope to beat an (information-theoretically optimal) classical
monference with a neural one. But we can empirically do just as well.
As we augment neural architectures to match the <em>algorithmic structure</em> of more
powerful classical inference procedures, their performance improves.
Bidirectional recurrent nets are better than forward-only ones; bidirectional
networks with <a href="http://arxiv.org/abs/1602.08210">multiple layers between each “real” hidden
vector</a> might be even better for some tasks.</p>
<p>Better yet, we can perhaps worry less about harder cases, when we previously
would have needed to hand-tune some approximate inference scheme. (One example:
suppose our transition matrix is a huge permutation. It might be very expensive
to do repeated multiplications for classical inference, and trying to take a
low-rank approximation to the transition matrix will lose information. But a
neural monference can potentially represent our model dynamics quite
compactly.)</p>
<p>So far we’ve been looking at sequences, but analogues for more structured data
exist as well. For tree-shaped problems, we can run something that looks like
the <a href="http://www.socher.org/uploads/Main/SocherBauerManningNg_ACL2013.pdf">inside algorithm over a fixed
tree</a> or
the <a href="https://aclweb.org/anthology/D/D15/D15-1137.pdf">inside–outside algorithm over a whole sparsified parse
chart</a>. For arbitrary graphs,
we can apply repeated <a href="http://arxiv.org/pdf/1509.09292.pdf">“graph
convolutions”</a> that start to look a lot
like belief propagation.</p>
<p>There’s a general principle here: anywhere you have an inference algorithm that
maintains a distribution over discrete states, instead:</p>
<ol>
<li>replace {chart cells, discrete distributions} with vectors</li>
<li>replace messages between cells with recurrent networks</li>
<li>unroll the “inference” procedure for a suitable number of iterations</li>
<li>train via backpropagation</li>
</ol>
<p>The resulting monference has at least as much capacity as the corresponding
classical procedure. To the extent that appproximation is necessary, we can (at
least empirically) <em>learn</em> the right approximation end-to-end from the training
data.</p>
<p>(The version of this that backpropagates through an approximate inference
procedure, but doesn’t attempt to learn the inference function itself, has <a href="http://cs.jhu.edu/~jason/papers/#stoyanov-ropson-eisner-2011">been
around</a>
for <a href="http://www.cs.cmu.edu/~mgormley/papers/gormley+dredze+eisner.tacl.2015.pdf">a while</a>.)</p>
<p>I think there’s at least one more constituency parsing paper to be written
using all the pieces of this framework, and lots more for working with
graph-structured data.</p>
<hr />
<p>I’ve argued that the monference perspective is useful, but is it true? That is,
is there a precise sense in which a neural net is <em>really</em> a monference, and not
a model?</p>
<p>No. There’s a fundamental identifiability problem—we can’t really distinguish
between “fancy model with trivial inference” and “mystery model with complicated
inference”. Thus it also makes no sense to ask, given a trained neural network,
which model it performs inference for. (On the other hand, networks trained via
<a href="http://arxiv.org/abs/1503.02531">distillation</a> seem like good candidates for
“same model, different monference”.) And the networks-as-_models_ perspective
shouldn’t be completely ignored: it’s resulted in a fruitful line of work that
replaces log-linear potentials with neural networks <a href="http://www.eecs.berkeley.edu/~gdurrett/papers/durrett-klein-acl2015.pdf">inside
CRFs</a>.
(Though one of the usual selling points of these methods is that “you get to
keep your dynamic program”, which we’ve argued here is true of suitably
organized recurrent networks as well.)</p>
<p>In spite of all this, as research focus in this corner of the machine learning
community shifts towards <a href="http://nips2015.sched.org/event/4G4h/reasoning-attention-memory-ram-workshop">planning, reasoning, and harder algorithmic
problems</a>,
I think the neural-nets-as-monferences perspective should dominate.</p>
<p>More than that—when we look back on the “deep learning revolution” ten years
from now, I think the real lesson will be the importance of end-to-end training
of decoders and reasoning procedures, even in systems that <a href="http://arxiv.org/abs/1601.01705">barely
look
like neural networks at all</a>. So when building
learning systems, don’t ask: “what is the probabilistic relationship among my
variables?”. Instead ask: “how do I approximate the inference function for my
problem?”, and attempt to learn this approximation directly. To do this
effectively, we can use everything we know about classical inference procedures.
But we should also start thinking of inference as a first-class part of the
learning problem.</p>
<hr />
<p>Thanks to Matt Gormley (whose EMNLP talk got me thinking about these issues), and
Robert Nishihara and Greg Durrett for feedback.</p>
<p>Also Jason Eisner for this gem: “An awful portmanteau, since monference should be a count
noun like model, but you took the suffix from the mass noun. Not that I can claim that infedel
is much better…”</p>
Thu, 18 Feb 2016 00:00:00 +0000
http://blog.jacobandreas.net/monference.html
http://blog.jacobandreas.net/monference.htmlPlanning in representation space<p>Agents parameterized by neural nets (Atari players etc.) seem to universally
suffer from an ability to plan. This is obvious in the case of Markov reflex
agents like vanilla deep Q learners, and seems to be true even of agents with
some amount of hidden state (like the MemN2N paper at NIPS). Nevertheless
planning-like behaviors have been successfully applied to other deep models,
most notably text generation—beam decoding, and even beam-aware training, seem
to be essential for both MT and captioning. And of course, real planning is
ubiquitous among people working on non-toy control problems.</p>
<p>Task and motion planning is a good example. At the end of the day, we need to
solve a continuous control problem, but attempting to solve this directly
(either with a generic control policy or TrajOpt-like procedure) is too hard.
Instead we come up with some highly simplified, hand-specified encoding of
the problem—perhaps a STRIPS representation that discards geometry. We solve
the (comparatively easy) STRIPS planning problem, and then project back down
into motion planning space. This projection might not correspond to a feasible
policy! (But we want things feasible in task space to be feasible in motion
space as much as possible.) We keep searching in planning space until we find a
solution that also works in task space.</p>
<p>This is really just a coarse-to-fine pruning scheme—we want a cheap way to
discard plans that are obviously infeasible, so we can devote all of our
computational resources to cases that really require simulation.</p>
<p>We can represent this schematically:</p>
<p><img src="figures/planning_representations_diagram.png" style="width: 30%" /></p>
<p>Here we have a representation function <script type="math/tex">r</script>, a true cost function <script type="math/tex">c</script> (which
we may want to think of as a 0–1 feasibility judgment), and a “representation
cost” <script type="math/tex">k</script>. We want to ensure that <script type="math/tex">r</script> is “close to an isomorphism” from
motion costs to task costs, in the sense that <script type="math/tex">c(s_1, s_2) \approx k(r(s_1),
r(s_2))</script>.</p>
<p>For the STRIPS version, we assume that <script type="math/tex">r</script> and <script type="math/tex">k</script> are given to us by hand.
But can we <em>learn</em> a better representation than STRIPS for solving task and
motion planning problems?</p>
<h2 id="learning-from-example-plans">Learning from example plans</h2>
<p>First suppose that we have training data in the form of successful sequences
with motion-space waypoints <script type="math/tex">(s_1, s_2, \ldots, s^*)</script>. Then we can directly
minimize an objective of the form</p>
<script type="math/tex; mode=display">L(\theta) = \sum_i \left[c(s_i, s_{i+1}) - k_\theta(r_\theta(s_i),
r_\theta(s_{i+1}))\right]^2</script>
<p>for <script type="math/tex">r</script> and <script type="math/tex">k</script> parameterized by <script type="math/tex">\theta</script>. Easiest if representation space (the codomain
of <script type="math/tex">r</script>) is <script type="math/tex">\mathbb{R}^d</script>; then we can manipulate <script type="math/tex">d</script> to control the tradeoff
between representation quality and the cost of searching in representation space.</p>
<p>Problem: if we only ever observe constant <script type="math/tex">c</script> (which might be the
case if we only see good solutions), there’s no pressure to learn a nontrivial
<script type="math/tex">k</script>. So we also want examples of unsuccessful attempts.</p>
<h2 id="decoding">Decoding</h2>
<p>Given a trained model, we can solve new instances by:</p>
<ol>
<li>Sample a cost-weighted path through representation space <script type="math/tex">(r_1, r_2, ..., r_n)</script>
such that <script type="math/tex">r(s^*) \approx r_n</script>.</li>
<li>Map each representation space transition <script type="math/tex">r_1 \to r_2</script> onto a motion space
transition <script type="math/tex">s_1 \to s_2</script> such that <script type="math/tex">r(s_2) \approx r_2</script>. (Easily
expressed as an opt problem if <script type="math/tex">r</script> is differentiable, but harder as a
policy.)</li>
<li>Repeat until one of the motion-space solutions is feasible.</li>
</ol>
<p>At every step that involves computing a path (whether in <script type="math/tex">r</script>-space or
<script type="math/tex">s</script>-space, we can use a wide range of possible techniques, whether
optimization-based (TrajOpt), search-based (RRT, though probably not in high
dimensions), or by learning a policy parameterized by the goal state.</p>
<h2 id="learning-directly-from-task-feedback">Learning directly from task feedback</h2>
<p>What if we don’t have good traces to learn from? Just interleave the above two
steps—starting from random initialization,
roll out to a sequence of predicted <script type="math/tex">r</script> and <script type="math/tex">s</script>, then treat this as
supervision, and again update <script type="math/tex">k</script> to reflect observed costs.</p>
<h2 id="informed-search">Informed search</h2>
<p>So far we’re assuming we can just brute-force our way through representation
space until we get close to the goal. There’s nothing to enforce that closeness
in representation space corresponds to closeness in motion space (other than the
possible smoothness of <script type="math/tex">r</script>). We might want to add an additional constraint
that if <script type="math/tex">r_i</script> is definitely three hops from <script type="math/tex">r_n</script>, then <script type="math/tex">||r_i - r_n|| >
||r_{i+1} - r_n||</script> or something similar. This immediately provides a useful
heuristic for the search in representation space.</p>
<p>We can also use side-information at this stage—maybe advice provided in the
form of language or a video demonstration. (Then we need to learn another
mapping from advice space to representation space.)</p>
<h2 id="modularity">Modularity</h2>
<p>It’s common to define several different primitive operations in the STRIPS
domain—e.g. just “move” and “grasp”. We might analogously want to give our
agent access to a discrete inventory of different policies, with associated
transition costs <script type="math/tex">k_1, k_2, \ldots</script>. Now the search problem involves both
(continuously) choosing a set of points, and (discretely) choosing cost
functions / motion primitives for moving between them. The associated motions of
each of these primitives might be confined to some (hand-picked) sub-manifold of
configuration space (e.g. only move the end effector, only move the first
joint).</p>
<hr />
<p>Thanks to Dylan Hadfield-Menell for useful discussions about task and motion
planning.</p>
Sun, 17 Jan 2016 00:00:00 +0000
http://blog.jacobandreas.net/planning-representations.html
http://blog.jacobandreas.net/planning-representations.htmlPrograms made of neural networks<p>A crude history of applied machine learning:</p>
<blockquote>
<p>Whenever we have a low-capacity model with hand-engineered structural
constraints, and replace it with a high-capacity model with simple features
and few structural constraints, model quality improves [models are smaller,
take less time to develop, and generalize better to unseen data].</p>
</blockquote>
<p>See (in NLP): linear models replacing decision lists, Jelinek’s infamous
linguists, statistical machine translation, the recent flurry of papers whose
entire substance is “replace this log-linear model (a two-layer neural net) with
a three-layer neural net”.</p>
<p>A crude history of programming languages:</p>
<blockquote>
<p>Whenever we have a programming language with lots of simple constructs, and we
replace them with a few high-level constructs, program quality
improves [programs of equivalent complexity are shorter, take less time to
develop, and are less likely to contain bugs].</p>
</blockquote>
<p>First everyone stopped writing assembly, then everyone stopped writing C.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>Notice: in each case we’re moving the same slider—just moving it in different
directions. Machine learning and programming language design ultimately have the
same goal: making it as easy as possible for certain problem-solving machines
(whether those are humans or optimization algorithms) to produce correct code
according to some specification. Out in the real world, we don’t favor machine
learning because it is inherently more pure or beautiful than writing code by
hand—we use it because it’s <em>effective</em>. If somebody released a library today
with a bunch of composable vision primitives, and suddenly Facebook to could
solve all their image-labeling problems more effectively with interns rather
than neural nets, then neural nets would be out the door tomorrow.</p>
<p>Actually, can we write this library now?</p>
<p>To be clear, I don’t mean something like OpenCV, where you take a bunch of
pre-implemented models for particular tasks and then do whatever
stitching-together you want in postprocessing. Instead, again, some notion of
little vision primitives from which it would be possible to write a classifier
as</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"> <span class="n">load</span><span class="o">(</span><span class="n">image</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">detectObjects</span> <span class="n">andThen</span>
<span class="n">orderBy</span><span class="o">(</span><span class="n">salience</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">head</span> <span class="n">andThen</span>
<span class="n">name</span></code></pre></figure>
<p>or a captioner as</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"> <span class="n">load</span><span class="o">(</span><span class="n">image</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">detectObjects</span> <span class="n">andThen</span>
<span class="n">describeAll</span></code></pre></figure>
<p>or a face detector as</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"> <span class="n">load</span><span class="o">(</span><span class="n">image</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">detectObjects</span> <span class="n">andThen</span>
<span class="n">filter</span><span class="o">(</span><span class="n">name</span><span class="o">(</span><span class="k">_</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Face</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">drawBoundaries</span></code></pre></figure>
<p>What do the functions <code class="highlighter-rouge">detectObjects</code>, <code class="highlighter-rouge">describeAll</code>, etc. look like? Current
experience suggests that they should be neural nets, but neural nets of a very
particular kind: rather than being trained to accomplish some particular task
(like image captioning), they’re trained to be freely composable: <code class="highlighter-rouge">describeAll</code>
promises to take anything “like a list of detections” (whether directly from
<code class="highlighter-rouge">detectObjects</code> or subsequently filtered) and produce a string. Note in
particular that the inputs and outputs to these functions are all real
vectors. There is no way to structurally enforce that a thing “like a list of
detections” actually has the desired semantics, and instead we rely entirely on
the training procedure.</p>
<p>In current real-world implementations, there’s a notion of <em>layers</em> as modular,
pre-specified units, but <em>networks</em> as monolithic models customized for specific
tasks (and requiring end-to-end training). Once we move to modular networks,
though, we can start to perform tasks for which no training data exists. For
example, “write a caption about the people in this image”:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"> <span class="n">load</span><span class="o">(</span><span class="n">image</span><span class="o">)</span> <span class="n">andThen</span>
<span class="n">detectObjects</span> <span class="n">andThen</span>
<span class="n">filter</span><span class="o">(</span><span class="n">name</span><span class="o">(</span><span class="k">_</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Face</span><span class="o">)</span>
<span class="n">andThen</span> <span class="n">describeAll</span></code></pre></figure>
<p>using the same primitives specified above.</p>
<p>Steps we are already taking in this direction: the fact that people use a prefix
of an image classification network to initialize models for basically every
other vision task; the fact that “attention” is suddenly being treated as a
primitive in model descriptions even though it’s a complicated sequence of
operations for combining multiple layers. Roger Grosse’s beautiful <a href="http://www.cs.toronto.edu/~rgrosse/uai2012-matrix.pdf">paper on
grammars over matrix factorization
models</a> also kind of
looks like this, and Christopher Olah has a <a href="http://colah.github.io/posts/2015-09-NN-Types-FP/">discussion of the type-theoretic
niceties</a> of neural nets
understood as collections of reusable modules (though to me this seems largely
secondary to the practical question of what these types are).</p>
<p>To bring this back to the earlier programming language discussion, we observe
that:</p>
<ol>
<li>
<p>It’s hard for a person to write down a person-detector by hand, but easy for
a neural net.</p>
</li>
<li>
<p>Given appropriate functional vision primitives, it’s easy for a person to
write down a person <em>describer</em>. But training a neural net to do this from
scratch requires a lot of examples of people descriptions to do this. (We
might then say it’s “easy” for people but “hard” for a neural net.)</p>
</li>
</ol>
<p>To take this yet a step further, we can note that there are lots of machine
learning techniques that are more human-like than neural-net-like, in the sense
that they do well with tiny data sets and a good pre-specified inventory of
primitives (e.g. program induction, semantic parsing). If we really just care
about minimal human intervention, we can figure out our vision primitives and
then hand them off to a machine learning subsystem of an entirely different
kind.</p>
<p>So let’s write this library! There are research questions here: First, what is
the right set of functional primitives to give people (or models for program
induction)? Next, can these shared representations actually be learned? How do
we find parameter settings for these modules using the kinds of labeled data
currently available?</p>
<p>Disclosure: I already have a model like this working on a bunch of simple
question-answering tasks about images—I think it’s a really exciting
proof-of-concept, and I’ll hopefully be able to show it off soon. But it’s not a
comprehensive solution (esp. if we want to interface between vision / language /
control applications), and I think there’s a really interesting systems problem
here too.</p>
<hr />
<p>Followup:</p>
<ul>
<li><a href="http://arxiv.org/abs/1511.02799">Deep compositional question answering with neural networks</a></li>
<li><a href="http://arxiv.org/abs/1601.01705">Learning to compose neural networks for question answering</a></li>
</ul>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Obviously this is a gross overstatement, since lots of people do continue to
write assembly and C. But I think it’s less controversial to say that <em>fewer</em>
people write in low-level languages, and that it’s harder to do so correctly. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 06 Sep 2015 00:00:00 +0000
http://blog.jacobandreas.net/programming-with-nns.html
http://blog.jacobandreas.net/programming-with-nns.htmlR&P notes: Theories that are Indirectly Self-Defeating<h2 id="the-self-interest-theory">The self-interest theory</h2>
<p>First we distinguish between <strong>formal</strong> and <strong>substantive</strong> aims. Formal aims
are “act morally” and “act rationally”, and are essentially meta-ethical
principles; substantive aims are the concrete realizations of the formal aims
according to particular theories of morality or rationality.</p>
<blockquote>
<p>It strikes me as somewhat suspect to put “act rationally” in parallel with
“act morally”. It ought to be sufficient to say “act as morally as possible”;
<em>as possible</em> is the entire criterion for rational action. If I have no moral
aims at all, there is no sense in which I can act irrationally; rational
behavior is defined with respect to some objective.</p>
</blockquote>
<p>One possible substantive aim is that people should act in their self-interest.
This might mean various things (the details don’t matter). In summary we say
simply that each person’s “supremely rational ultimate aim” is that their life
go as well as possible (for some definition of well).</p>
<blockquote>
<p>I don’t understand why the modifier “rational” is necessary to describe this
ultimate aim. It seems like it should be sufficient to say that preferences
over world states are well-ordered, and there is some dominating set of world
states which a person hopes to bring about.</p>
</blockquote>
<h2 id="how-s-can-be-indirectly-self-defeating">How S can be indirectly self-defeating</h2>
<p>A theory T is <strong>indirectly self-defeating</strong> if, when someone tries to achieve
their T-given aims, those aims are worse achieved (compared to a world in which
they made no effort at all). Self-defeat is defined with respect to an
individual (more properly an individual and an environment), and is not a
universal property of moral theories. Indirect self-defeat might happen because
an actor is incompetent (and unable to effectively achieve their own
aims)—this case is uninteresting (here DP asserts that the self-interest
theory is “not too difficult to follow”).</p>
<blockquote>
<p>The claim that at choosing the self-interest-maximizing action is <em>easy</em> seems
outrageous—often, such choices are PSPACE-complete! Am I missing
something? We can design optimization problems that are arbitrarily hard;
recognizing a good course of action is easier than constructing one (though
not always itself easy).</p>
</blockquote>
<p>The more interesting case is where the
actor comes to a worse outcome by effectively pursuing a moral theory. A couple
of examples are given, but the prototype here is the prisoner’s dilemma. In
particular, the self-interest theory is indirectly self-defeating for an agent
who always defects, and who advertises to all partners that he will defect.</p>
<blockquote>
<p>W/r/t the prisoner’s dilemma, it is possible (though quite strange) to imagine
an agent that is constitutionally a dominant-strategy player—we have to
assert that all kinds of pre-commitment mechanisms (like hiring someone to
murder them if they ever defect) are totally off-limits. So to the extent that
we are ultimately concerned with <em>human</em> morals, this example seems unhelpful.
I don’t think there are any healthy humans that are totally incapable of
cooperating under any circumstances.</p>
</blockquote>
<h2 id="does-s-tell-us-to-be-never-self-denying">Does S tell us to be never self-denying?</h2>
<p>Under the self-interest theory, rationality is precisely the condition of always
acting in one’s self-interest. A rational agent ought to maintain only those
beliefs and goals that further their self-interest. These beliefs might be
irrational! If for some reason I am happier believing in Russell’s teapot than
not, I should do whatever is necessary to believe in the teapot. Or, with an
agent for whom (rationally) always following self-interest leads to worse
outcomes, such an agent ought not to behave rationally.</p>
<blockquote>
<p>It seems that we’ve misread DP w/r/t the prisoner’s dilemma
discussion. Previously we imagined an obligate dominant strategy player, and
observed that it achieved a worse outcome than it might if it were not
obligated to play the dominant strategy. Now this agent is apparently
choosing between (“rationally”) playing the dominant strategy and
(“irrationally”) cooperating. Obviously if cooperating leads to better
outcomes, it is rational. A few explanations of what might be going on here:</p>
<ol>
<li>DP is wrong about game theory</li>
<li>DP imagines a (more-and-more strangely-constructed) agent who wrongly
understands what it means to be rational, and thinks that choosing to be
rational requires defection. (We previously dismissed such agents as
uninteresting.)</li>
</ol>
<p>More generally, we seem to be making a distinction between “meta-level”
rationality (being rational w/r/t/ choice of decisionmaking procedures) and
“ordinary” rationality (being rational w/r/t non-meta decisions). This
distinction seems arbitrary and unhelpful, but also necessary to explain why
it might be “rational to make myself irrational”.</p>
</blockquote>
<p>Recall that rationality is a formal, not substantive aim; it is a means to
achieve the goals of self-interest, but not (necessarily) itself part of those
goals. S, coupled with a particular theory of rationality, says that the
supremely rational disposition is to be never self-denying, but that the aims
of S are better achieved by not holding this disposition.</p>
<blockquote>
<p>If this is the only point of the whole preceding discussion, then maybe the
problems we’ve raised don’t much matter. The confusion is that DP means
by “rational” something other than the standard economic definition. From now
on I will denote DP’s rationality “P-rationality” to distinguish it from the
ordinary kind.</p>
</blockquote>
<h2 id="why-s-does-not-fail-in-its-own-terms">Why S does not fail in its own terms</h2>
<p>So does S, by being indirectly self-defeating, fail in its own terms? No, only
theories that are directly self-defeating. Because rationality is a formal aim,
it is not the case that being indirectly self-defeating requires that we behave
P-rationally—in fact, it requires that we don’t!</p>
<p>But can we actually choose whether or not to behave P-rationally? There are
actually two pieces to this:</p>
<ol>
<li>Do I believe that “rational” means “P-rational”</li>
<li>Must I act in a way that I believe to be rational? (Can I change my
disposition?)</li>
</ol>
<p>Suppose I cannot change my disposition. Then it is the case that my disposition
tells me to act rationally. So I should simply change my belief about
rationality to be something other than P-rationality, and there’s no problem.
Suppose instead that I can change my disposition without changing my belief
about rationality. Then I should simply change my position, and there’s no
problem. The final possibility is that I can change neither my belief nor my
disposition; we will return to this case.</p>
<blockquote>
<p>My difficulty up to here has been the notion of a belief about rationality:
rationality should come before belief, and is a framework for producing true
beliefs. But of course this is not true in the real world! Otherwise we
wouldn’t have whole internet communities devoted to changing people’s belief
about rationality. The fact that DP takes P-rationality to be the default
belief is a little strange, but at this point it’s a socialogical claim rather
than a philosophical one, so there’s no trouble yet.</p>
</blockquote>
<h2 id="could-it-be-rational-to-cause-oneself-to-act-irrationally">Could it be rational to cause oneself to act irrationally?</h2>
<p>Now we get an example of a thrid belief about rationality. Suppose a criminal
breaks into my house, and threatens to harm me if I don’t give him my gold. It
is rational to give him my gold, but even better if I can make myself immune to
his demands. One way to do this is to temporarily render myself completely
irrational. Here rationality tells me that I should (at least temporarily)
render myself non-rational rather than P-rational or rational.</p>
<blockquote>
<p>Note that this is equivalent to a pre-commitment scheme, in which I pre-commit
to not changing my behavior in response to torture. It is nullified if the
criminal says something like “I will harm you if you fail to give me your gold
<em>or</em> you adopt any pre-commitment scheme.” Are all changes in one’s belief
about rationality equivalent to pre-commitment schemes?</p>
<p>Here’s an equivalent perspective: given a fixed objective function (a “theory”
to DP), rationality tells me how to best maximize that objective function. In
the robber case, it is useful for me to temporarily force myself to maximize a
different objective function (we can take “irrational behavior” to correspond
to irrationality with a constant objective). (A pre-commitment scheme is a
special case of this, where I assign negative utility to some outcomes.) Is it
the case that for any <em>strategy</em>, there is always some <em>objective</em> that will
cause me to exhibit that strategy when behaving rationally? This seems like
the sort of thing that economists have already proved; if it’s true, then I
can always behave rationally and rely on pre-commitment / objective changes to
obtain the same result that DP discusses.</p>
<p>All of this assumes that I am totally free to adopt pre-commitment schemes,
which may not be true in practice.</p>
</blockquote>
<h2 id="how-s-implies-that-we-cannot-avoid-acting-irrationally">How S implies that we cannot avoid acting irrationally</h2>
Sun, 06 Sep 2015 00:00:00 +0000
http://blog.jacobandreas.net/parfit-1.html
http://blog.jacobandreas.net/parfit-1.htmlreading