Regular vs Context-Free Languages
Alpaca splits language processing into two stages — lexer and parser — because the two stages handle fundamentally different classes of languages. The lexer recognizes regular languages; the parser recognizes context-free languages. This page explains what that distinction means and why it matters.
The Chomsky Hierarchy (at a Glance)
Formal languages are classified by the power of the grammar needed to describe them:
| Type | Name | Recognizer | Example |
|---|---|---|---|
| 3 | Regular | Finite automaton | Token patterns ([0-9]+, [A-Za-z]+) |
| 2 | Context-free | Pushdown automaton | Nested structures ([...], {...}) |
| 1 | Context-sensitive | Linear-bounded automaton | Type checking, scope rules |
| 0 | Recursively enumerable | Turing machine | Arbitrary computation |
Each level strictly contains the one below it. Every regular language is context-free, but not every context-free language is regular.
What Makes a Language Regular?
A language is regular if it can be described by a regular expression (or equivalently, recognized by a finite automaton with no stack or memory).
Regular languages can express:
- Fixed strings:
"if","while" - Character classes:
[0-9]+,[A-Za-z_]+ - Alternation:
"true" | "false" - Repetition:
\s+,[a-z]*
Regular languages cannot express:
- Balanced nesting: matched
[and]pairs - Counting: "exactly as many
as asbs" - Cross-references: "the string before
=must match the string after="
Why Balanced Brackets Are Not Regular
Consider the language of balanced square brackets: [], [[]], [[[]]], etc. To recognize this language, a machine must count how many opening brackets it has seen and verify the same number of closing brackets follows. A finite automaton has a fixed number of states — it cannot count to an arbitrary depth.
The BrainFuck lexer can track bracket depth (using ctx.squareBrackets) and reject some malformed inputs — an unmatched ] or leftover [ at end of input. But that is still not full context-free parsing: the lexer emits a flat token stream and does not build nested structure. The parser's grammar rules and parse stack turn [ ... ] sequences into the nested While nodes in the AST.
The pumping lemma (informal)
The pumping lemma for regular languages states: for any regular language L, there exists a length p such that any string s in L with |s| >= p can be split into s = xyz where |xy| <= p, |y| > 0, and xy^i z is in L for all i >= 0.
For balanced brackets, pumping the opening brackets produces strings like [[[]] — unbalanced, and not in the language. This contradicts the lemma, proving balanced brackets are not regular.
Why the Lexer/Parser Split?
Given that context-free grammars are strictly more powerful, why not use one grammar for everything — including token recognition?
Performance. Regular language recognition (finite automata) runs in O(n) with no backtracking. Context-free parsing (LR) also runs in O(n) but with a larger constant factor due to the parse stack. Processing characters one at a time through a DFA is faster than pushing and popping a parse stack for every character.
Simplicity. Token patterns are naturally described by regex. Expressing [0-9]+(\.[0-9]+)? as a context-free grammar requires multiple productions — the regex is both shorter and clearer.
Separation of concerns. The lexer handles character-level details (whitespace, comments, escape sequences) and produces a clean token stream. The parser handles structural details (nesting, precedence, associativity). Neither needs to know about the other's concerns.
BrainFuck Mapping
In BrainFuck>:
- Regular (lexer handles): recognizing
>,<,+,-,.,,,[,],!, function names ([A-Za-z]+), and counting brackets viactx - Context-free (parser handles): matching
[with]to formWhileloops, matchingname(body)to formFunctionDef, sequencing operations into lists via.List
The lexer emits a flat stream of tokens. The parser gives them structure.
Cross-links
- See The Lexer: Regex to Finite Automata for how Alpaca compiles regex into automata.
- See Context-Free Grammars for the formal definition of CFGs and how Alpaca maps them to parser rules.
