COMPILER DESIGN (CST-012) UTU

Lexical Analyzer C Program

Designing a lexical analyzer (lexer) in C involves creating a program that reads an input source code and identifies tokens, while ignoring whitespace, comments, and ensuring identifiers are of reasonable length. Here’s a simplified version of such a lexer:

Lexical Analyzer in C

Token Types: Define the types of tokens you will recognize (e.g., keywords, identifiers, numbers, operators).
Ignore Whitespace and Comments: Handle spaces, tabs, newlines, and comments.
Identifier Length Limit: Set a maximum length for identifiers.

Example Code

Below is a basic implementation of a lexical analyzer in C.

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>

#define MAX_IDENTIFIER_LENGTH 32
#define MAX_TOKEN_LENGTH 64

typedef enum {
    TOKEN_IDENTIFIER,
    TOKEN_NUMBER,
    TOKEN_OPERATOR,
    TOKEN_KEYWORD,
    TOKEN_EOF,
    TOKEN_UNKNOWN
} TokenType;

typedef struct {
    TokenType type;
    char value[MAX_TOKEN_LENGTH];
} Token;

const char *keywords[] = {"if", "else", "while", "return", NULL};

int is_keyword(const char *identifier) {
    for (int i = 0; keywords[i] != NULL; i++) {
        if (strcmp(identifier, keywords[i]) == 0) {
            return 1;
        }
    }
    return 0;
}

void skip_whitespace_and_comments(FILE *source) {
    char ch;
    while ((ch = fgetc(source)) != EOF) {
        if (isspace(ch)) {
            continue;
        }
        if (ch == '/') {
            char next = fgetc(source);
            if (next == '/') {
                // Skip single-line comment
                while ((ch = fgetc(source)) != 'n' && ch != EOF);
            } else if (next == '*') {
                // Skip multi-line comment
                while (1) {
                    ch = fgetc(source);
                    if (ch == '*' && (next = fgetc(source)) == '/') {
                        break;
                    }
                    if (ch == EOF) {
                        break;
                    }
                }
            } else {
                ungetc(next, source);
                break;
            }
        } else {
            ungetc(ch, source);
            break;
        }
    }
}

Token get_next_token(FILE *source) {
    Token token;
    token.type = TOKEN_UNKNOWN;
    token.value[0] = '';

    skip_whitespace_and_comments(source);

    char ch = fgetc(source);
    if (ch == EOF) {
        token.type = TOKEN_EOF;
        return token;
    }

    if (isalpha(ch)) {
        int index = 0;
        do {
            if (index < MAX_IDENTIFIER_LENGTH - 1) {
                token.value[index++] = ch;
            }
            ch = fgetc(source);
        } while (isalnum(ch) || ch == '_');
        token.value[index] = '';
        token.type = is_keyword(token.value) ? TOKEN_KEYWORD : TOKEN_IDENTIFIER;
    } else if (isdigit(ch)) {
        int index = 0;
        do {
            if (index < MAX_TOKEN_LENGTH - 1) {
                token.value[index++] = ch;
            }
            ch = fgetc(source);
        } while (isdigit(ch));
        token.value[index] = '';
        token.type = TOKEN_NUMBER;
    } else if (strchr("+-*/=", ch)) {
        token.value[0] = ch;
        token.value[1] = '';
        token.type = TOKEN_OPERATOR;
    } else {
        token.value[0] = ch;
        token.value[1] = '';
        token.type = TOKEN_UNKNOWN;
    }

    ungetc(ch, source); // Push back the last read character
    return token;
}

void print_token(Token token) {
    switch (token.type) {
        case TOKEN_IDENTIFIER:
            printf("IDENTIFIER: %sn", token.value);
            break;
        case TOKEN_NUMBER:
            printf("NUMBER: %sn", token.value);
            break;
        case TOKEN_OPERATOR:
            printf("OPERATOR: %sn", token.value);
            break;
        case TOKEN_KEYWORD:
            printf("KEYWORD: %sn", token.value);
            break;
        case TOKEN_EOF:
            printf("EOFn");
            break;
        case TOKEN_UNKNOWN:
        default:
            printf("UNKNOWN: %sn", token.value);
            break;
    }
}

int main() {
    FILE *source = fopen("source.txt", "r");
    if (!source) {
        perror("Could not open source file");
        return EXIT_FAILURE;
    }

    Token token;
    do {
        token = get_next_token(source);
        print_token(token);
    } while (token.type != TOKEN_EOF);

    fclose(source);
    return EXIT_SUCCESS;
}

Explanation

Token Structure: A simple structure to represent tokens.
Keyword Checking: The is_keyword function checks if an identifier is a keyword.
Skipping Whitespace and Comments: The skip_whitespace_and_comments function handles whitespace, single-line (//) and multi-line (/* */) comments.
Token Generation: The get_next_token function reads the next token, categorizing it into identifiers, numbers, operators, keywords, or unknown.
Main Function: The main function opens a source file, retrieves tokens in a loop, and prints them until the end of the file is reached.

Usage

Save the code in a file, e.g., lexer.c.
Create a source.txt file containing some code to analyze.
Compile and run the program:

gcc lexer.c -o lexer
./lexer

Example Input

Suppose you have the following code in source.txt:

// This is a sample program
int main() {
    int x = 10; // Variable declaration
    if (x > 5) {
        return x; /* Return x */
    }
    return 0;
}

Expected Output

When you run the lexer on this input, the output could look like this

KEYWORD: int
IDENTIFIER: main
OPERATOR: (
KEYWORD: int
IDENTIFIER: x
OPERATOR: =
NUMBER: 10
OPERATOR: ;
KEYWORD: if
OPERATOR: (
IDENTIFIER: x
OPERATOR: >
NUMBER: 5
OPERATOR: )
OPERATOR: {
KEYWORD: return
IDENTIFIER: x
OPERATOR: ;
OPERATOR: /*
KEYWORD: return
NUMBER: 0
OPERATOR: ;
OPERATOR: }
OPERATOR: }

Explanation of the Output

Each line represents a token recognized by the lexer.
Tokens are categorized as KEYWORD, IDENTIFIER, NUMBER, or OPERATOR.
Comments and whitespace are ignored, meaning you won’t see any tokens for them in the output.
The lexer correctly identifies keywords like int and return, variables like main and x, and operators like = and >.

This output provides a clear view of the different components of the source code, which is the primary purpose of a lexical analyzer. You can modify the source.txt file with different code snippets to see how the lexer behaves with various inputs!

Exercise Files

Compiler Practical 1.pdf

Size: 69.43 KB

AKTU

UTU

Popular Categories

Dashboard Pages

Join more than

10k learners

worldwide

Lexical Analyzer C Program

Lexical Analyzer in C

Example Code

Explanation

Usage

Example Input

Expected Output

Explanation of the Output

GET IT TOUCH

Student Support

CATEGORIES

SHOP NOW

+(91) 8630016443

Support@Aktuhub.in

Take your learning with you

Follow us on social media