Course Content
COMPILER DESIGN (CST-012) UTU

Lexical Analyzer C Program

Designing a lexical analyzer (lexer) in C involves creating a program that reads an input source code and identifies tokens, while ignoring whitespace, comments, and ensuring identifiers are of reasonable length. Here’s a simplified version of such a lexer:

Lexical Analyzer in C

  1. Token Types: Define the types of tokens you will recognize (e.g., keywords, identifiers, numbers, operators).
  2. Ignore Whitespace and Comments: Handle spaces, tabs, newlines, and comments.
  3. Identifier Length Limit: Set a maximum length for identifiers.

Example Code

Below is a basic implementation of a lexical analyzer in C.

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>

#define MAX_IDENTIFIER_LENGTH 32
#define MAX_TOKEN_LENGTH 64

typedef enum {
    TOKEN_IDENTIFIER,
    TOKEN_NUMBER,
    TOKEN_OPERATOR,
    TOKEN_KEYWORD,
    TOKEN_EOF,
    TOKEN_UNKNOWN
} TokenType;

typedef struct {
    TokenType type;
    char value[MAX_TOKEN_LENGTH];
} Token;

const char *keywords[] = {"if", "else", "while", "return", NULL};

int is_keyword(const char *identifier) {
    for (int i = 0; keywords[i] != NULL; i++) {
        if (strcmp(identifier, keywords[i]) == 0) {
            return 1;
        }
    }
    return 0;
}

void skip_whitespace_and_comments(FILE *source) {
    char ch;
    while ((ch = fgetc(source)) != EOF) {
        if (isspace(ch)) {
            continue;
        }
        if (ch == '/') {
            char next = fgetc(source);
            if (next == '/') {
                // Skip single-line comment
                while ((ch = fgetc(source)) != 'n' && ch != EOF);
            } else if (next == '*') {
                // Skip multi-line comment
                while (1) {
                    ch = fgetc(source);
                    if (ch == '*' && (next = fgetc(source)) == '/') {
                        break;
                    }
                    if (ch == EOF) {
                        break;
                    }
                }
            } else {
                ungetc(next, source);
                break;
            }
        } else {
            ungetc(ch, source);
            break;
        }
    }
}

Token get_next_token(FILE *source) {
    Token token;
    token.type = TOKEN_UNKNOWN;
    token.value[0] = '';

    skip_whitespace_and_comments(source);

    char ch = fgetc(source);
    if (ch == EOF) {
        token.type = TOKEN_EOF;
        return token;
    }

    if (isalpha(ch)) {
        int index = 0;
        do {
            if (index < MAX_IDENTIFIER_LENGTH - 1) {
                token.value[index++] = ch;
            }
            ch = fgetc(source);
        } while (isalnum(ch) || ch == '_');
        token.value[index] = '';
        token.type = is_keyword(token.value) ? TOKEN_KEYWORD : TOKEN_IDENTIFIER;
    } else if (isdigit(ch)) {
        int index = 0;
        do {
            if (index < MAX_TOKEN_LENGTH - 1) {
                token.value[index++] = ch;
            }
            ch = fgetc(source);
        } while (isdigit(ch));
        token.value[index] = '';
        token.type = TOKEN_NUMBER;
    } else if (strchr("+-*/=", ch)) {
        token.value[0] = ch;
        token.value[1] = '';
        token.type = TOKEN_OPERATOR;
    } else {
        token.value[0] = ch;
        token.value[1] = '';
        token.type = TOKEN_UNKNOWN;
    }

    ungetc(ch, source); // Push back the last read character
    return token;
}

void print_token(Token token) {
    switch (token.type) {
        case TOKEN_IDENTIFIER:
            printf("IDENTIFIER: %sn", token.value);
            break;
        case TOKEN_NUMBER:
            printf("NUMBER: %sn", token.value);
            break;
        case TOKEN_OPERATOR:
            printf("OPERATOR: %sn", token.value);
            break;
        case TOKEN_KEYWORD:
            printf("KEYWORD: %sn", token.value);
            break;
        case TOKEN_EOF:
            printf("EOFn");
            break;
        case TOKEN_UNKNOWN:
        default:
            printf("UNKNOWN: %sn", token.value);
            break;
    }
}

int main() {
    FILE *source = fopen("source.txt", "r");
    if (!source) {
        perror("Could not open source file");
        return EXIT_FAILURE;
    }

    Token token;
    do {
        token = get_next_token(source);
        print_token(token);
    } while (token.type != TOKEN_EOF);

    fclose(source);
    return EXIT_SUCCESS;
}

Explanation

  • Token Structure: A simple structure to represent tokens.
  • Keyword Checking: The is_keyword function checks if an identifier is a keyword.
  • Skipping Whitespace and Comments: The skip_whitespace_and_comments function handles whitespace, single-line (//) and multi-line (/* */) comments.
  • Token Generation: The get_next_token function reads the next token, categorizing it into identifiers, numbers, operators, keywords, or unknown.
  • Main Function: The main function opens a source file, retrieves tokens in a loop, and prints them until the end of the file is reached.

Usage

  1. Save the code in a file, e.g., lexer.c.
  2. Create a source.txt file containing some code to analyze.
  3. Compile and run the program:
gcc lexer.c -o lexer
./lexer

Example Input

Suppose you have the following code in source.txt:

// This is a sample program
int main() {
    int x = 10; // Variable declaration
    if (x > 5) {
        return x; /* Return x */
    }
    return 0;
}

Expected Output

When you run the lexer on this input, the output could look like this

KEYWORD: int
IDENTIFIER: main
OPERATOR: (
KEYWORD: int
IDENTIFIER: x
OPERATOR: =
NUMBER: 10
OPERATOR: ;
KEYWORD: if
OPERATOR: (
IDENTIFIER: x
OPERATOR: >
NUMBER: 5
OPERATOR: )
OPERATOR: {
KEYWORD: return
IDENTIFIER: x
OPERATOR: ;
OPERATOR: /*
KEYWORD: return
NUMBER: 0
OPERATOR: ;
OPERATOR: }
OPERATOR: }

Explanation of the Output

  • Each line represents a token recognized by the lexer.
  • Tokens are categorized as KEYWORD, IDENTIFIER, NUMBER, or OPERATOR.
  • Comments and whitespace are ignored, meaning you won’t see any tokens for them in the output.
  • The lexer correctly identifies keywords like int and return, variables like main and x, and operators like = and >.

This output provides a clear view of the different components of the source code, which is the primary purpose of a lexical analyzer. You can modify the source.txt file with different code snippets to see how the lexer behaves with various inputs!

Exercise Files
Compiler Practical 1.pdf
Size: 69.43 KB
0% Complete