Lexer Error Recovery
The Alpaca lexer provides two layers of error feedback: compile-time validation that catches many problems before your program runs, and runtime error strategies for input that does not match any pattern.
Under the hood: compile-time validation
The lexer macro validates token definitions at compile time. Pattern shadowing (ShadowException), invalid regex syntax, and unsupported guards are caught during compilation. The macro performs pairwise regex inclusion checks using the dregex library to ensure every pattern is reachable.
Compile-Time Errors
ShadowException
A ShadowException occurs when one pattern can never match because an earlier pattern always matches first. If every string that pattern B can match is also matched by pattern A, and A appears before B, then B is unreachable:
import alpaca.*
// This does NOT compile -- ShadowException
val Lexer = lexer:
case "[A-Za-z]+" => Token["ID"] // general: any letters
case "[A-Za-z][A-Za-z0-9]*" => Token["WORD"] // ERROR: shadowed by ID
The fix: move the more specific pattern before the more general one, or remove the duplicate.
Invalid Regex
Malformed Java regex patterns -- unmatched parentheses, invalid quantifiers, bad character class syntax -- produce a compile-time error. The message identifies the pattern and includes the underlying regex engine error.
Guards Not Supported
Pattern guards (case "regex" if condition =>) are not supported in lexer rules. The workaround is to move the condition into the rule body:
import alpaca.*
case class BrainLexContext(
var squareBrackets: Int = 0,
) extends LexerCtx
val BrainLexer = lexer[BrainLexContext]:
case "\\[" =>
ctx.squareBrackets += 1
Token["jumpForward"]
case "\\]" =>
// Validate inside the body, not as a guard
require(ctx.squareBrackets > 0, "Mismatched brackets")
ctx.squareBrackets -= 1
Token["jumpBack"]
case "\\+" => Token["inc"]
case "." => Token.Ignored
Pattern Ordering
Patterns are tried in the order they appear. The first match wins. The general rule: more specific patterns before more general ones.
In the BrainFuck lexer, this matters for the print command vs the catch-all:
import alpaca.*
// RIGHT -- literal dot before catch-all dot
val BrainLexer = lexer:
case "\\." => Token["print"] // specific: literal dot (BF print)
case "." => Token.Ignored // general: any character (catch-all)
If you reverse the order, "." shadows "\\." and you get a ShadowException.
The same applies to keywords vs identifiers. Function names in the extended BrainFuck lexer must come after command tokens:
import alpaca.*
// RIGHT -- single-char commands before the general name pattern
val BrainLexer = lexer:
case "\\+" => Token["inc"]
case "-" => Token["dec"]
// ... other single-char commands ...
case name @ "[A-Za-z]+" => Token["functionName"](name) // general: after commands
case "." => Token.Ignored
Runtime Error Handling
When tokenize() hits a character that matches no pattern, the lexer consults the ErrorHandling strategy for the context type.
Default Behavior
The default strategy throws a RuntimeException:
Unexpected character at line 1, position 5: '@'
With LexerCtx.Default, the error message includes line and position. With LexerCtx.Empty or a custom context without tracking, it shows only the character.
Error Handling Strategies
You can provide a custom ErrorHandling instance for your context type. Four strategies are available:
| Strategy | Behavior |
|---|---|
Throw(ex) |
Throw the given exception, aborting tokenization immediately |
IgnoreChar |
Skip the single unmatched character and continue |
IgnoreToken |
Skip to the next successful match and continue |
Stop |
Stop tokenization gracefully, returning lexemes collected so far |
import alpaca.*
import alpaca.internal.lexer.ErrorHandling
case class BrainLexContext(
var squareBrackets: Int = 0,
var position: Int = 1,
var line: Int = 1,
) extends LexerCtx with PositionTracking with LineTracking
// Custom error handler: report position and throw
given ErrorHandling[BrainLexContext] = ctx =>
ErrorHandling.Strategy.Throw:
RuntimeException(s"Unexpected character at line ${ctx.line}, position ${ctx.position}: '${ctx.text.charAt(0)}'")
To silently skip unknown characters (useful for BrainFuck where non-command characters are comments):
import alpaca.*
import alpaca.internal.lexer.ErrorHandling
// Skip unrecognized characters instead of throwing
given ErrorHandling[LexerCtx.Default] = _ =>
ErrorHandling.Strategy.IgnoreChar
Note that the BrainFuck lexer from Getting Started already handles this more explicitly with a "." => Token.Ignored catch-all pattern, which is the recommended approach when you want to ignore unknown input.
Limitations
- No skip-and-continue by default. The default strategy aborts on the first unmatched character. Use a custom
ErrorHandlingor a catch-all pattern for resilience. - Guards are not supported. Pattern guards in lexer rules are a compile-time error. Move conditions into rule bodies.
- Error position is only available with tracking traits. Without
PositionTrackingorLineTracking, the error message shows only the character, not its location.
