Lexical Analyzer C Program
Designing a lexical analyzer (lexer) in C involves creating a program that reads an input source code and identifies tokens, while ignoring whitespace, comments, and ensuring identifiers are of reasonable length. Here’s a simplified version of such a lexer:
Lexical Analyzer in C
- Token Types: Define the types of tokens you will recognize (e.g., keywords, identifiers, numbers, operators).
- Ignore Whitespace and Comments: Handle spaces, tabs, newlines, and comments.
- Identifier Length Limit: Set a maximum length for identifiers.
Example Code
Below is a basic implementation of a lexical analyzer in C.
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#define MAX_IDENTIFIER_LENGTH 32
#define MAX_TOKEN_LENGTH 64
typedef enum {
TOKEN_IDENTIFIER,
TOKEN_NUMBER,
TOKEN_OPERATOR,
TOKEN_KEYWORD,
TOKEN_EOF,
TOKEN_UNKNOWN
} TokenType;
typedef struct {
TokenType type;
char value[MAX_TOKEN_LENGTH];
} Token;
const char *keywords[] = {"if", "else", "while", "return", NULL};
int is_keyword(const char *identifier) {
for (int i = 0; keywords[i] != NULL; i++) {
if (strcmp(identifier, keywords[i]) == 0) {
return 1;
}
}
return 0;
}
void skip_whitespace_and_comments(FILE *source) {
char ch;
while ((ch = fgetc(source)) != EOF) {
if (isspace(ch)) {
continue;
}
if (ch == '/') {
char next = fgetc(source);
if (next == '/') {
// Skip single-line comment
while ((ch = fgetc(source)) != 'n' && ch != EOF);
} else if (next == '*') {
// Skip multi-line comment
while (1) {
ch = fgetc(source);
if (ch == '*' && (next = fgetc(source)) == '/') {
break;
}
if (ch == EOF) {
break;
}
}
} else {
ungetc(next, source);
break;
}
} else {
ungetc(ch, source);
break;
}
}
}
Token get_next_token(FILE *source) {
Token token;
token.type = TOKEN_UNKNOWN;
token.value[0] = '';
skip_whitespace_and_comments(source);
char ch = fgetc(source);
if (ch == EOF) {
token.type = TOKEN_EOF;
return token;
}
if (isalpha(ch)) {
int index = 0;
do {
if (index < MAX_IDENTIFIER_LENGTH - 1) {
token.value[index++] = ch;
}
ch = fgetc(source);
} while (isalnum(ch) || ch == '_');
token.value[index] = '';
token.type = is_keyword(token.value) ? TOKEN_KEYWORD : TOKEN_IDENTIFIER;
} else if (isdigit(ch)) {
int index = 0;
do {
if (index < MAX_TOKEN_LENGTH - 1) {
token.value[index++] = ch;
}
ch = fgetc(source);
} while (isdigit(ch));
token.value[index] = '';
token.type = TOKEN_NUMBER;
} else if (strchr("+-*/=", ch)) {
token.value[0] = ch;
token.value[1] = '';
token.type = TOKEN_OPERATOR;
} else {
token.value[0] = ch;
token.value[1] = '';
token.type = TOKEN_UNKNOWN;
}
ungetc(ch, source); // Push back the last read character
return token;
}
void print_token(Token token) {
switch (token.type) {
case TOKEN_IDENTIFIER:
printf("IDENTIFIER: %sn", token.value);
break;
case TOKEN_NUMBER:
printf("NUMBER: %sn", token.value);
break;
case TOKEN_OPERATOR:
printf("OPERATOR: %sn", token.value);
break;
case TOKEN_KEYWORD:
printf("KEYWORD: %sn", token.value);
break;
case TOKEN_EOF:
printf("EOFn");
break;
case TOKEN_UNKNOWN:
default:
printf("UNKNOWN: %sn", token.value);
break;
}
}
int main() {
FILE *source = fopen("source.txt", "r");
if (!source) {
perror("Could not open source file");
return EXIT_FAILURE;
}
Token token;
do {
token = get_next_token(source);
print_token(token);
} while (token.type != TOKEN_EOF);
fclose(source);
return EXIT_SUCCESS;
}
Explanation
- Token Structure: A simple structure to represent tokens.
- Keyword Checking: The
is_keywordfunction checks if an identifier is a keyword. - Skipping Whitespace and Comments: The
skip_whitespace_and_commentsfunction handles whitespace, single-line (//) and multi-line (/* */) comments. - Token Generation: The
get_next_tokenfunction reads the next token, categorizing it into identifiers, numbers, operators, keywords, or unknown. - Main Function: The main function opens a source file, retrieves tokens in a loop, and prints them until the end of the file is reached.
Usage
- Save the code in a file, e.g.,
lexer.c. - Create a
source.txtfile containing some code to analyze. - Compile and run the program:
gcc lexer.c -o lexer
./lexer
Example Input
Suppose you have the following code in source.txt:
// This is a sample program
int main() {
int x = 10; // Variable declaration
if (x > 5) {
return x; /* Return x */
}
return 0;
}
Expected Output
When you run the lexer on this input, the output could look like this
KEYWORD: int
IDENTIFIER: main
OPERATOR: (
KEYWORD: int
IDENTIFIER: x
OPERATOR: =
NUMBER: 10
OPERATOR: ;
KEYWORD: if
OPERATOR: (
IDENTIFIER: x
OPERATOR: >
NUMBER: 5
OPERATOR: )
OPERATOR: {
KEYWORD: return
IDENTIFIER: x
OPERATOR: ;
OPERATOR: /*
KEYWORD: return
NUMBER: 0
OPERATOR: ;
OPERATOR: }
OPERATOR: }
Explanation of the Output
- Each line represents a token recognized by the lexer.
- Tokens are categorized as KEYWORD, IDENTIFIER, NUMBER, or OPERATOR.
- Comments and whitespace are ignored, meaning you won’t see any tokens for them in the output.
- The lexer correctly identifies keywords like
intandreturn, variables likemainandx, and operators like=and>.
This output provides a clear view of the different components of the source code, which is the primary purpose of a lexical analyzer. You can modify the source.txt file with different code snippets to see how the lexer behaves with various inputs!