Skip to content

Multi-character Tokens

We can extend the concept of reading two-character symbols to those which contain more than two characters. Tokens like keywords and identifiers, e.g. import, true or false, contain varying number of characters and need to be handled in a special way. Let’s use the default block of the switch statement to implement this logic.

Identifiers and keywords

src/lexer/index.ts
switch (this.currentChar) {
default: {
// Identifier/Keyword/Primitive type
if (/[A-Za-z_]/.test(this.currentChar)) {
let value = this.currentChar;
this.pos++;
while (
this.src[this.pos] !== undefined &&
/[A-Za-z0-9_]/.test(this.src[this.pos])
) {
value += this.src[this.pos];
this.pos++;
}
if (GOM_KEYWORDS.has(value)) {
return {
type: getKeywordType(value),
value,
start: this.pos - value.length,
end: this.pos - 1,
};
}
if (GOM_PRIMITIVE_TYPES.has(value)) {
return {
type: getPrimitiveType(value),
value,
start: this.pos - value.length,
end: this.pos - 1,
};
}
return {
type: GomToken.IDENTIFIER,
value,
start: this.pos - value.length,
end: this.pos - 1,
};
}
throw new SyntaxError({
start: this.pos,
message: `Unidentified character '${this.currentChar}'`,
});
}
}

To match an identifier or a keyword, we first test if the starting character matches the regular expression /[A-Za-z_]/ , i.e. letters or underscore. Then, we run a second while loop to keep matching characters until the next character is not one of /[A-Za-z0-9_]/. This way we can match multi-character tokens that contain alphanumeric characters (plus underscore).

Once we have the token, we check if it matches any Gom keyword or primitive type name and return the corresponding token. If it doesn’t, we return an GomType.IDENTIFIER token. Identifiers denote any user-defined values like variable and function names, function arguments, custom types and package names.

Literals

The Gom syntax states that the language has two types of literals — string and numeric. These can be handled similar to keywords and identifiers. The characteristic of a string literal is that it is always enclosed in " " , and that of numeric literal is that all of its characters are digits. This information helps in writing the logic to scan them.

src/lexer/index.ts
switch(this.currentChar) {
...
default: {
...
// String literal
if (this.currentChar === '"') {
let value = this.currentChar;
let start = this.pos;
this.pos++;
while (this.src[this.pos] !== '"') {
value += this.src[this.pos];
this.pos++;
}
value += this.src[this.pos];
this.pos++;
return {
type: GomToken.STRLITERAL,
value,
start,
end: this.pos - 1,
};
}
}
}

Similarly, number literals can be matched using /[0-9]/ expression.