Is there a role for fake language datasets in the AI ecosystem? (By “fake language” I mean things like bAbI, CLEVR, Karthik’s “Home” world [pdf link], and this DeepMind paper.) Claims that various learning architectures can “do language processing” based on results from these datasets have caused a lot of hand-wringing in the NLP community. While a lot of this is due to good old-fashioned overclaiming, I’ve become persuaded some of it is miscommunication between two groups that mean totally different things by “language data”.

I’m going to focus here on instruction following, but I think there’s a similar story to tell about lots of other grounded tasks like question answering, generation, etc.1 For a long time, an AI researcher’s view of an instruction following problem was something like this:

Language -> Abstraction -> Behavior

That is, we start with whatever utterances people are generating, map them to some kind of clean, structured representation, and then make decisions about how to act on the basis of that structure. Because this whole pipeline was too hard to tackle all at once, the community mostly started working on it from different ends.2


“Language people” worked on this:

Language -> Abstraction

In this picture, language comes to you from the outside world—you don’t control the distribution. You get to design the language of abstractions, but it had better be able to handle (or at least fail gracefully on) whatever utterances the world throws at you. Linguists give us a nice abstraction formalism in the form of logic, and that the way to get from abstraction to behavior is just logical interpretation. So it’s very easy to say “abstraction = formal semantics” and treat the Abstraction -> Behavior edge as someone else’s problem.

Data is collected from human speakers who don’t necessarily know anything about logical forms. Indeed, decisions about details of the logical language are typically worked out after collecting initial annotations. What distinguishes “language data” from other kinds of data is precisely the fact that it was generated by human users. (If we’re generating data from a fake grammar and mapping it onto logical forms, we generally haven’t learned anything about language that we didn’t write down in the first place.)


“Policy people” (broadly understood to include everything from RL to planning to classical control) worked on this:

Abstraction -> Behavior

In this picture, the scope of acceptable abstractions is up to the system designer—it’s in behavior where details of the real world (physics, etc.) intervene. Abstraction languages range from “do one of these 10 specific things” to “satisfy this STRIPS goal”. In particular, an abstraction language that doesn’t support all possible goals is no more problematic than a remote control that doesn’t operate all appliances at once.3 The thing that I originally found difficult to appreciate is just how hard some of these problems are even when we have complete control over the input distribution. Reinforcement learning is hard. Planning is hard. There’s still lots of room for interesting compositionality in these abstraction languages—if I have some kind of structured representation of the goal, and I train on a subset of structures, do I generalize to the rest? There’s lots we still can’t do.

To come up with problems that are within reach of current methods, data is generated rather than collected. The distribution over abstractions and their induced behaviors is hand-designed. There’s no language data here; what distinguishes “language data” from what does get used is that language has no precise execution semantics, but e.g. STRIPS does.


In the last few years these two communities have run together, because the world looks like this:

Language -> Behavior

Everything is end-to-end all the time. Abstraction isn’t gone, but now it lives in some uninterpretable representation space rather than a formalism we’ve designed by hand. This is a great thing! Language people no longer have to limit themselves to worlds where they’re clever enough to construct a good enough logical language.

And policy people (here’s where the trouble starts) no longer have to describe their task inventory in terms of any particular formalism: they just need some way of generating reward functions / goal tests synchronously with some kind of (compositional?) identifier that describes them. So they generate interpretable strings made of sequences of words. No execution semantics, uses English words: natural language. Whence the confusion.

I think there’s a lesson in this for people in both communities:

  • For “policy people” as researchers: please please please signpost explicitly when your input strings were generated synthetically. The word language is hopelessly overloaded at this point, but the bigram natural language is not: avoid using the word natural unless people were involved. (A couple of the papers linked in the first paragraph are guilty of this.)

  • For “language people” as reviewers: respond to appropriately qualified fake language datasets by asking “Is an interesting Abstraction -> Behavior problem is being solved? Do strings index the target class of behaviors in an interesting way?” There are lots of problems out there for which this is an appropriate standard.

I think we’re still at a stage where there’s something to learn from fake language, even those of us who ultimately care about the distribution of sentences produced by humans.


  1. Explicitly excluded from this discussion is work on doing linguistic analysis of fake language data. Such work is not about language at all, and is either about analyzing the formal expressive capacity of certain model classes, or is garbage. 

  2. Not everyone! Both Stefanie Tellex and Branavan had work that was trying to tackle the complete pipeline well before the current end-to-end craze. I think this research is a model for what the field should be trying to accomplish, but for lots of problems our techniques just aren’t there yet. 

  3. There’s a sense in which Jonathan Berant and Percy Liang’s work on paraphrasing for semantic parsing [pdf link] actually belongs in this category, rather than Language -> Abstraction. But they still hold themselves to a “real language” evaluation standard.