Making the Elixir Parser atom-safe

When I gave my talk about the Elixir parser on Elixirconf EU, I was introduced by the host as: “the guy who does things with the Elixir parser that nobody ever thought of”. I’ll take that as a compliment, I guess :-) This post is about one part of that talk: how the Elixir parser used to create atoms while parsing and how I contributed to fixing that.

TL;DR

The upcoming Elixir version (1.9) will include an static_atoms_encoder option which allows you to override the parser’s atom-creation behaviour.
Contributing to Elixir is super easy and immensely rewarding.

Introduction

At Botsquad, we let people program chatbots by writing code in a specialized programming language, Bubblescript, which looks a bit like this:

Bubblescript example code

If you want to understand how this technically works, you could read an earlier blog post that I wrote about it. However for the purpose of this post, it is sufficient to understand that we use Elixir’s parser to parse the code and then execute it using the Bubblescript interpreter.

Elixir’s parser takes a string and transforms it into a data structure called the AST, an Abstract Syntax Tree. The parser can be run using the Code.string_to_quoted/2 function.

Given the following, “Elixir-ish” code:

dialog main do
  say "Hello world"
end

Will be transformed by the parser into the following AST data structure:

{:dialog, [line: 1],
 [
   {:main, [line: 1], nil},
   [
     do: {:say, [line: 2],
      ["Hello world!"]}
   ]
 ]}

To execute this as a chat script, the Bubblescript interpreter looks at this AST and “walks” over each line in the dialog to output a string to the chat client. This works all very well. …but, I hear you think, “Is It Safe?”

The Atom issue

Elixir’s documentation clearly warns us for it, you should not take user input and create atoms from it, in any way. This is because the VM’s atom table has a fixed size, and atoms that are allocated are never garbage collected. So if your atom count exceeds 1M atoms, the VM will just crash, and not in a friendly way.

For this reason, using the Elixir parser on user input is unsafe: Code.string_to_quoted/2 creates atoms while parsing code into an AST. It creates atoms for each identifier that it encounters: names of dialogs, functions, variables, et cetera.

For instance, the line my_variable = 42 translates to the following AST:

{:=, [line: 1],
 [{:my_variable, [line: 1], nil}, 42]}

You see that a new atom, :my_variable, is returned by the parser. So is it safe to let users type their own code this way? No.

Sidetrack: allowing existing atoms only

The parser has an option, existing_atoms_only: true, and if you pass it, it will refuse to parse a string when it encounters an unknown atom (basically using String.to_existing_atom/2). That may sound like a possible solution, until you realise that because we allow users to write their own scripts, in a programming language, we do not want them to be restricted in any naming conventions. They need to be able to choose their own variable names, dialog names et cetera. So we cannot use the existing atoms option while parsing their code, unfortunately.

The quickfix

Looking into Elixir’s parser, I discovered that the atoms were created as part of the tokenization process; the process of splitting the text into a list of identifiers, strings, et cetera. This is done in the elixir_tokenizer.erl file. There is a function there, unsafe_to_atom/4, which takes a string and returns an atom. Or, when this existing_atoms_only option is passed, it throws an error when the atom does not yet exist.

My quickfix was to piggyback on this existing_atoms_only option, to add a :safe flag, which, instead of creating atoms, tags each would-be atom as a tuple, indicating that it should really be an atom.

The patch was fairly trivial and looked something like this:

The first fix

And I presented that existing_atoms_only: :safe “patch” in my talk at Elixirconf EU:

The quickfix

While it wasn’t pretty, it did the job, and you can actually still test this on the AST Ninja if you add the Tokenizer panel and check the ‘safe’ option.

The final fix

Before the conference I had already started a parser-related thread on the elixir-lang-core mailing list, so I found myself lucky because mr. José Valim himself was attending my talk :-)

Afterwards, he came up to me and we discussed my solution. We settled on using a callback-style parameter which handles the encoding of atoms. That way, it is left to the user how to deal with unknown atoms. Over the next couple of days, we discussed the exact implementation details on the mailing list. And again a few days later I submitted a pull request, introducing the static_atoms_encoder option to the Code.string_to_quoted/2 function. After a few back-and-forths with the core team members it got merged. From the docs:

[the static_atoms_encoder function] is supposed to create an atom from the given string. It is required to return either {:ok, term}, where term is an atom. It is possible to return something else than an atom, however, in that case the AST is no longer “valid” in that it cannot be used to compile or evaluate Elixir code. A use case for this is if you want to use the Elixir parser in a user-facing situation, but you don’t want to exhaust the atom table.

As a usage example, in the following snipped, the given encoder function returns {:atom, …} instead of returning real atoms:

encoder = fn string, _meta -> {:ok, {:atom, string}} end
Code.string_to_quoted("hello world", static_atoms_encoder: encoder)

Which gives us:

{:ok, {{:atom, "hello"}, [line: 1],
       [{{:atom, "world"}, [line: 1], nil}]}}

So, this way we have now a safe way of parsing any Elixir code into an AST! Problem solved! 🎉

Conclusion

This seemingly small issue has been in the back of my head for a long time. I am glad that as soon as we upgrade our platform to the next Elixir version, we don’t have to worry about it anymore.

Besides that, going through this experience of contributing to Elixir has been a great learning experience for me. The core team members do a great job of managing all of this and I have the utmost respect of them, at the way they manage issues, expecations from users, delegate tasks, and set boundaries for what should go into the Elixir core and what not. And this all while being very approachable: polite, friendly and open minded to new and experienced developers alike. And did I mention super-responsive?

As I have hopefully shown, contributing to Elixir can be easy and immensely rewarding. So if you have a nagging issue, or even if you just spot a typo in the docs somewhere, you know what to do: read the contributing guildelines, maybe discuss it first, on IRC, the Elixir forum or the core mailinglist, and create a PR. It’s as simple as that :-)

Arjan Scherpenisse

Co-founder Botsquad

email me