Compilers and interpreters in essence are the same things. They both have to make sense of your code, execute the instructions as well as finding ways to detect errors.
The first step : Lexical Analysis or Lexical Scanning
Lexical analysis or scanning is the process whereby the written code is analysed, char by char including line by line.
This process is very easy. You are mapping everything i.e. you are knowing what and where everything is.
Let us say you have the code below :
x = 3+1
output “x :”+3
The lexical analysis will return
and so on ….
with some brushing up it returns
line 1 column 1 : x
line 1 column 2 : //or you can make it write SPACE
line 1 column 3 : =
line 1 column 4 : //or you can make it write SPACE
line 2 column 5 : 3
line 2 column 6 : +
line 2 column 5 : 1
line 2 column 5 : \n //or you can make it write NEWLINE
and so on. (yes space and new line are characters)
i.e you have everything exactly where it is. There are two reasons for that :
1.. To recognise expressions
To know what is there in the code
2.. To find errors
You may have encountered a return like :
error line 3 string has no attr…
how did it know that the error was on line three, well while trying to make sense out your code it could not and returned the location of the error. it knows the location as it scanned it.
You need 3 variables :
Using a loop you’ll increase position
Each time there is a new line character, the line variable increases by one.
The column variable increases with position but returns to 0 each time there is a new line character.
In the coming posts we’ll insha allah build lexical analysers
The second step : Tokenisation
Now that you have exactly what is in the file, it is time that you identify the pieces of the code. You now classify the elements of the code.
The reason why it is called tokenisation is that now we break the code into parts or tokens (remember playing tokens).
We can have numbers. A number token can be like that
addition operator might be like
ADD + or we might just leave it as + as it is unique
if there is like 1+2 we can brand this as EXPRESSION 1+2
for strings, STRING i am here
Of course, first we must have already defined our code to indentify the tokens so that when our expression comes, we get a second output like that
source code : x=1+3
tokenised code : VARIABLE x = EXPRESSION NUM 1 + NUM 3
Tokenisation is thus just an intermediary form, preparing it to be fed to what is called the parser. The parser does the real job, but the tokenisation step makes it easier to process
The third step : Parsing
In this step the tokens are evaluated. It checks the validity of the code like :
it begins a scan of tokens
so x is a variable, then we have the assignment symbol then we have an expression ok we have a number variable ok ho the addition symbol, the next one is an integer, good i have to add them and assign them to x
Now, if the parser encounters like a string adding with an integer, it stops scanning immediately and pulls up an error.
The fourth step : Precedence of expressions
Now, after everything is verified and in place, our code can safely be executed. now we need to define precedence else our ececution flow will be left confused. There are two precedence to take care of :
1.. Precedence in the face of similarity
Let us say that we have 3+2+1
s0, our execution cannot evaluate all at once, it can only evaluate 2. So, we build in something called left precedence where we begin by the left. i.e evaluating 3+2 then 5+1 or putting it like : (3+2)+1
2.. Operator precedence
like we have 3+2*6
what to evaluate first? 3+2 or 2*6 ?
That’s why a definite rule was laid saying we’ll evaluate first
About syntax trees
We can represent precedence of expressions in tree-like form
1 +2 + 3
| _| |
| _| |
The fifth step : Code Execution
Now we can execute our code. For an interpreter, it will be immeduately executed as part of the job is already done.
For a compiler it will have to go down until the code runs on the operating system sometimes translating first into assembly code as many assembly to os specific tools exist.
The same process above is optimised. Take for example an interpreter that creates tokens for white spaces. An optimisation might be to ignore white spaces. Like that years are spent on optimisation!