- Why NASM? -
2014-11-18

-----------------------------
= 1 - What is an assembler? =
-----------------------------

Grossly simplified:

-------------------------------------------------------
   Wuss                Badass 
    |                    |                             
    |                    |                             
 ___v___              ___v____               _________ 
|       |  compiler  |        |  assembler  |  .COM   |
|C, C++,| ---------> |Assembly| ----------> |(machine |
|PASCAL | (COMPLEX!) |language|  (SIMPLE!)  |language)|
|_______|            |________|             |_________|
                                                 |     
                                                 |     
                                                _v_    
                                               |   |   
                                               |CPU|
                                               |___|
-------------------------------------------------------
Let's say we want to write a program that beeps the speaker and then immediately quits.

C
    #include 
    int main()
    {
        printf("%c", 0x7);  /* ASCII character 7 = 'BELL' */
        return;
    }

C is relatively divorced from the hardware.  This means that programs are eas(ier) to port to other operating systems and other CPUs.  If someone has already ported the C compiler over, then you can just recompile your program on the new platform and it spits out assembly language compatible with the new machine's hardware and OS.  The above program will compile and work on MS-DOS, Windows and Linux without modifications.

But, there's a cost too.  A typical C compiler will generate thousands of lines of assembly for that simple program I typed above.  It's kind of disgusting really.  Because we are not a wuss, we only care about running our program on DOS.  Therefore we are free to implement the code directly in assembly language as follows.

Assembly language:
    mov		ax, 0E07h                                     
    int		10h                                           
    mov		ah, 4Ch
    int		21h

Machine code:
    B8 07 0E
    CD 10
    B4 4C
    CD 21

This program is 9 bytes.  That's about 1,000 times smaller than the .EXE you get from compiling the C program in Watcom C!

Every CPU cycle, a chunk of machine code is loaded from memory straight into logic gates on the CPU.  The CPU decodes these numbers to determine what it should do that cycle.

Some of the numbers in the machine code will look familiar.  The constants 0E07, 10, 4C, and 21 from the assembly language also show up in the machine code; their bytes are just flipped around.  There are some new numbers too.  These are the numeric op codes for the instructions 'mov' and 'int'.  Notice 'mov ax' becomes B8, while 'mov ah' becomes B4, and 'int' which is called twice shows up the same both times (CD).

Assembly language and machine code are not very different.  You can pretty much implement an assembler using find-and-replace to translate from the 'mnemonic' CPU instruction names into the numeric op codes in the machine code.  

-----------------------------------------------------------------
= 2 - If assemblers are so simple, why choose one over another? =
-----------------------------------------------------------------

Assemblers differ in what they do for you, and what they force you to keep track of yourself.  On one extreme, you have the notional "search and replace" assembler I mentioned above, which is very simple.  On the other extreme, you have programs like MASM (Microsoft Macro Assembler) and TASM (Borland Turbo Assembler).  They do a lot of work for you; they keep track of variable types, allow simplified ways of accessing memory, and the like.  This can be nice, but it can also make your programs very hard to understand.

For example, consider the following line of MASM (Microsoft Macro Assembler) code:

    mov     Points,0

This code could be doing a lot of different things:

    - moving one byte, 00, into the byte of memory at address 'Points'

    - moving two bytes, 0000, into the word at address 'Points'

    - moving four bytes, 00000000, into the dword at address 'Points'

The only way to find out what it's doing is to look up the declaration of the variable 'Points' and see how big it is.

Confusing matters further, in MASM it's synonymous to write that line with square brackets around the variable name.  So these two lines are equal:

    mov     Points,0
    mov     [Points],0

While these two lines are not equal (square brackets have a special meaning when you are using registers)

    mov     ax,0
    mov     [ax],0          ; Different behavior!

By comparison NASM (Netwise Assembler) is brain dead simple, and that is a Good Thing.  Here's how you would write that first statement in NASM: 

    mov     byte [Points], 0

We know just by looking at it: this line loads one byte, 00, into the byte at address 'Points'.  The square brackets are mandatory:  they say that Points is a address.  The "byte" says to just load one byte.  There's no other way of writing this line and it is completely unambiguous.

Here's another example of NASM's syntax being way clearer than MASM's:

MASM:
    mov     al,1000h[ebx][edx*2]        ; WHAT!

NASM
    mov     ax,[1000h + ebx + edx*2]    ; Oh. It's arithmetic.

The point of assembly language is knowing exactly what each line of the program is doing to CPU and memory.  Assemblers that do lots of funny business tend to get in the way of that.  If you'd benefit from fancy "smart" features, it's usually a sign that you should implement your program in C, or some language more powerful than assembly.
- Nav -
(L)atest, (A)rchive, (H)ome