CasuallyBlue's Site



Creating a hello world program in machine code

It’s post three of the series and we’re finally ready to write some actual machine code. The first thing we want to do is to test that the program can actually run something simple, so we will just have the program return immediately instead of segmentation faulting.

In a normal program we could just return from the function, but since we are writing a very basic ELF binary we don’t have any of the normal libc features. Because of this we’ll have to manually make the exit system call to end our program.

The Linux system call ABI

To make a system call or “syscall” we have to set up some values into various registers and then call the 0x80 interrupt which instructs the kernel to service our request. Unlike a normal function, all the syscall arguments have to be passed via registers rather than the stack.

Writing our program

For our program we want to just exit immediately returning a value so we will use the exit syscall which is syscall number one. We pass the syscall number in the eax register and the return value in the ebx register. The basic assembly code to do this is

mov eax, 1 ; We are calling system call 1
mov ebx, 0 ; We want to return 0 from the program
int 0x80 ; Execute the 0x80 processor interrupt

Our next step is converting this assembly code into machine code. We could run it through an assembler and get the code, but we’re trying to write a compiler from scratch so instead we’ll manually encode these instructions into hex. Make sure you make a copy of your elf header because you won’t want to mess it up and we’ll write several different programs that you may want to keep around.

We first want to mov the immediate value 1 into the eax register so we need to open up the Intel software developers manual and look up the encoding of the mov instruction.

Here is the section of the table that corresponds to the version of the instruction that we want:

Opcode Instruction Op/En 64-Bit Mode Compat / Leg Mode Description
B8+ rd id MOV r32, imm32 OI Valid Valid Move imm32 to r32

The only important fields for us now are the Opcode field and the Op/En (Operand Encoding) field. We next need to look at the table of operand encodings for mov to see how we need to encode our numbers

Op/EN Operand 1 Operand 2 Operand 3 Operand 4
OI opcode + rd (w) imm8/16/32/64 NA NA

The first byte of the move instruction is 0xb8 ORed with the register number. If we look at the binary representation of the opcode (0b10111000) we can see that the last three bytes of the opcode are zero leaving us enough space to select between the first eight registers. For eax that register is at index zero, leaving us with just b8. The next part of the instruction is the “immediate” value which is encoded directly in the instruction as the 32 bit version of the number so it’s just 0x010000000 for 1. This gives us a final instruction encoding of

0xb8 // Opcode
0x01 0x00 0x00 0x00 // Syscall number

We now need to encode move ebx, 0 which is pretty similar except that the immediate value is 0 and our instruction index is 3 or 0x011 which gives us a opcode of 0xbb and an immediate value of 0x00000000 giving us an encoding of:

0xbb // Opcode
0x00 0x00 0x00 0x00 // Return value

We now must call the interrupt instruction int with our interrupt number of 0x80

Opcode Instruction Op/En 64-Bit Mode Compat / Leg Mode Description
CD ib INT imm8 I Valid Valid Generate software interrupt with vector specified by immediate byte.

We can see that this instruction is just the opcode byte followed by the immediate byte representing the interrupt number, giving us

0xCD // Opcode
0x80 // Interrupt number

We can now put this code right after our elf header giving us a working binary that immediately returns zero. If we want, we can change the value we put in ebx to give us a different return value encoded as a 32 bit little endian value.

Hello World!

Now that we can return a value its finally time to print “Hello World!” from machine code. Its going to be a bit more complicated but its still a fairly simple program. Like before, we’ll start with an assembly language program and hand assemble it into machine code. This time we’ll be using the write syscall in addition to the exit syscall to end our program. Here’s the basic assembly:

mov eax, 4 ; 4 is the number of the write syscall
mov ebx, 0 ; We want to write bytes to file descriptor 0 (Also known as standard output)
mov ecx, hello_world ; We want to load the address of the string to print
mov edx, 12 ; The number of bytes we want to print from the string
int 0x80 ; Execute syscall

mov eax, 1 ; Exit syscall
mov ebx, 0 ; return 0
int 0x80 ; Execute syscall

hello: db "Hello World!", 0x0A ; Hello world + newline

This is a few more instructions than the previous program, but they’re all instructions that we already know how to assemble so the only thing we need to do is calculate the address of the string and put it in as an immediate value in ecx. We basically need to calculate how many bytes after the start of the code the bytes of the string will get stored and then write that value into ecx.

If we look at each mov instruction with immediate they take up five bytes, and there’s six of them, giving us 30 bytes, or an offset of 0x1D, the two interrupt calls each take up two bytes, which gives us four bytes, for a final offset of 0x22. We now add that offset to the address of the code start (0x7800400000000000). Since the upper half of the number is all zeros we can truncate the value to 32 bits since they will get cleared anyway when we load the register. Adding the offset to the address, we get 0x9A004000 which we load into ecx. Heres what the binary code will look like

// Print Hello World
0xB8 // mov eax,
0x04 0x00 0x00 0x00 // 4 (write syscall)
0xBB // mov ebx,
0x00 0x00 0x00 0x00 // 0 (file descriptor of stdin)
0xB9 // mov ecx,
0x9A 0x00 0x40 0x00 // Address of string in memory
0xba // mov edx,
0x0D 0x00 0x00 0x00 // number of bytes to write
0xC8 0x80 // Execute syscall

// Exit Program
0xB8 // mov eax,
0x01 0x00 0x00 0x00 // 1 (exit syscall)
0xBB // mov ebx,
0x00 0x00 0x00 0x00 // return 0
0xC8 0x80 // execute syscall

0x48 0x65 0x6C 0x6C 0x6F 0x20 0x57 0x6F 0x72 0x6C 0x64 0x21 0x0A // ascii characters for hello world as hex

Now that we have a working hello world, our next step will be to implement a simple program to read hexidecimal characters from the input and write them to the output as bytes. This will make writing machine code easier as we will no longer have to use a hex editor to write code into a file. That’s all for this post though, we’ll start the implementation of our hex writer in the next post.