Compilers and interpreters in essence are the same things. They both have to make sense of your code, execute the instructions as well as finding ways to detect errors.

The first step : Lexical Analysis or Lexical Scanning

Lexical analysis or scanning is the process whereby the written code is analysed, char by char including line by line.

This process is very easy. You are mapping everything i.e. you are knowing what and where everything is.

EXAMPLE

Let us say you have the code below :

x = 3+1

output “x :”+3

The lexical analysis will return

x

 

=

 

3

+

1

 

o

u

t

and so on ….

with some brushing up it returns

line 1 column 1 : x

line 1 column 2 :  //or you can make it write SPACE

line 1 column 3 : =

line 1 column 4 :   //or you can make it write SPACE

line 2 column 5 : 3

line 2 column 6 : +

line 2 column 5 : 1

line 2 column 5 : \n            //or you can make it write NEWLINE

and so on. (yes space and new line are characters)

i.e you have everything exactly where it is. There are two reasons for that :

1.. To recognise expressions

To know what is there in the code

2.. To find errors

You may have encountered a return like :

error line 3 string has no attr…

how did it know that the error was on line three, well while trying to make sense out your code it could not and returned the location of the error. it knows the location as it scanned it.

You need 3 variables :

position

column

line

Using a loop you’ll increase position

Each time there is a new line character, the line variable increases by one.

The column variable increases with position but returns to 0 each time there is a new line character.

In the coming posts we’ll insha allah build lexical analysers

The second step : Tokenisation

Now that you have exactly what is in the file, it is time that you identify the pieces of the code. You now classify the elements of the code.

The reason why it is called tokenisation is that now we break the code into parts or tokens (remember playing tokens).

We can have numbers. A number token can be like that

NUM 1

addition operator might be like

ADD + or we might just leave it as + as it is unique

if there is like 1+2 we can brand this as EXPRESSION 1+2

for strings, STRING i am here

Of course, first we must have already defined our code to indentify the tokens so that when our expression comes, we get a second output like that

source code : x=1+3

tokenised code : VARIABLE x = EXPRESSION NUM 1 + NUM 3

Tokenisation is thus just an intermediary form, preparing it to be fed to what is called the parser. The parser does the real job, but the tokenisation step makes it easier to process

The third step : Parsing

In this step the tokens are evaluated. It checks the validity of the code like :

it begins a scan of tokens

so x is a variable, then we have the assignment symbol then we have an expression ok we have a number variable ok ho the addition symbol, the next one is an integer, good i have  to add them and assign them to x

Now, if the parser encounters like a string adding with an integer, it stops scanning immediately and pulls up an error.

The fourth step : Precedence of expressions

Now, after everything is verified and in place, our code can safely be executed. now we need to define precedence else our ececution flow will be left confused. There are two precedence to take care of :

1.. Precedence in the face of similarity

Let us say that we have 3+2+1

s0, our execution cannot evaluate all at once, it can only evaluate 2. So, we build in something called left precedence where we begin by the left. i.e evaluating 3+2 then 5+1 or putting it like : (3+2)+1 

2.. Operator precedence

like we have 3+2*6

what to evaluate first? 3+2 or 2*6 ?

That’s why a definite rule was laid saying we’ll evaluate first

Parenthesis, then

Exponential

Division

Multiplication

Addition

Substraction

About syntax trees

We can represent precedence of expressions in tree-like form

e.g. 1+2+3

1 +2  +  3

| _|      |

  |___|

       |

(1 +2)*4

 | _|      |

   |___|

        |

The fifth step : Code Execution

Now we can execute our code. For an interpreter, it will be immeduately executed as part of the job is already done.

For a compiler it will have to go down until the code runs on the operating system sometimes translating first into assembly code as many assembly to os specific tools exist.

About Optimisation

The same process above is optimised. Take for example an interpreter that creates tokens for white spaces. An optimisation might be to ignore white spaces. Like that years are spent on optimisation!

Stay tuned!

Advertisements