| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| Hello, I just wanted to know if anyone knew of any good tutorials that taught how to write an assembler for x86 machine code. Also if there was any good tutorial that taught the 86 machine instructions (how the hex machine code works/is put together etc.). The best I know of is the Intel Software Developer Manuals, but I was wondering if there might be any good references/tutorials etc. online. |
|
#2
| |||
| |||
| On Sep 7, 11:56*am, Nick Mudge <spamt...@crayne.org> wrote: > Hello, > I just wanted to know if anyone knew of any good tutorials that taught > how to write an assembler for x86 machine code. Also if there was any > good tutorial that taught the 86 machine instructions (how the hex > machine code works/is put together etc.). The best I know of is the > Intel Software Developer Manuals, but I was wondering if there might > be any good references/tutorials etc. online. If you don't know bits, bytes and different base representations (as might be inferred from "how the hex machine code is put together"), you shouldn't be writing an assembler but instead focus on learning the basics. Now, the mentioned intel manuals will give you the most info on how instructions are encoded. The AMD ones will work too. In fact, you should be using both when there's some uncertainty in either. And then you may want or need to use tools (disassemblers and other assemblers) to clear all confusion. A basic assembler is pretty easy to do. All you need to do is: 1. parse the source file, get an instruction from a line (if any) 2. encode the instruction and emit the bytes of encoded instructions into a binary file However, to do that you will need to take several passes over the code being compiled/assemblied. The reason for that is the forward references. That is, you don't know the address of an instruction ahead before you reached it by generating code for all previous instructions. Example: CMP EAX, EBX JNE L1 MOV EAX, 0 JMP L2 L1: MOV EAX, 1 L2: RET Here you can't fully encode JNE L1 until you know how many bytes need the following instructions before L1. The same applies to JMP L2 and instructions between it and L2. Your assembler should also support very basic arithmetic expressions. Very often there's a need to calculate the distance between data objects or some instructions. This can be used to determine object's size. So, your assembler should be able to assembly an instruction like this: MOV EAX, L2 - L1 or a data variable declaration like this: V DD L2 - L1 In fact, it must be able to assembly any instruction with a purely additive/subtractive expression involving addresses, e.g.: LEA EAX, L1 + 5 or LEA EAX, V + (L2 - L1) and any expression involving a difference of addresses (of course, if it supports more complex expressions): MOV EAX, ((L2 - L1) * 5) SHR 1 You will need to support some form of the ORG operator/macro to tell the assembler at what rIP value the following instruction is supposed to be executed since the compiled x86 code generally isn't position independent. I think some special name for the entry point should also be supported if the ouput binary/object file is supposed to support an arbitrary entry point and it needs to be indicated somewhere in the file. A rich assembler has the following features: 1. support for expressions 2. support for all CPU modes 3. macros and special symbols 4. support for data structures 5. support for segments 6. code/data alighment 7. support of intermediate object files as the output format 8. symbolic/debug information generation 9. support for listing files 10. support for referencing of external objects 11. inclusion of text source files and binary data files 12. etc For a good C programmer familiar with the x86 assembly language, a basic assembler (not including most of the above list) would probably take no more than a month to implement and test. Alex |
|
#3
| |||
| |||
| On Sep 7, 5:03*pm, "Alexei A. Frounze" <spamt...@crayne.org> wrote: > On Sep 7, 11:56*am, Nick Mudge *<spamt...@crayne.org> wrote: > > > Hello, > > I just wanted to know if anyone knew of any good tutorials that taught > > how to write an assembler for x86 machine code. Also if there was any > > good tutorial that taught the 86 machine instructions (how the hex > > machine code works/is put together etc.). The best I know of is the > > Intel Software Developer Manuals, but I was wondering if there might > > be any good references/tutorials etc. online. > > If you don't know bits, bytes and different base representations (as > might be inferred from "how the hex machine code is put together"), > you shouldn't be writing an assembler but instead focus on learning > the basics. > > Now, the mentioned intel manuals will give you the most info on how > instructions are encoded. The AMD ones will work too. In fact, you > should be using both when there's some uncertainty in either. And then > you may want or need to use tools (disassemblers and other assemblers) > to clear all confusion. > > A basic assembler is pretty easy to do. All you need to do is: > 1. parse the source file, get an instruction from a line (if any) > 2. encode the instruction and emit the bytes of encoded instructions > into a binary file > > However, to do that you will need to take several passes over the code > being compiled/assemblied. The reason for that is the forward > references. That is, you don't know the address of an instruction > ahead before you reached it by generating code for all previous > instructions. Example: > > CMP EAX, EBX > JNE L1 > MOV EAX, 0 > JMP L2 > L1: > MOV EAX, 1 > L2: > RET > > Here you can't fully encode JNE L1 until you know how many bytes need > the following instructions before L1. The same applies to JMP L2 and > instructions between it and L2. > > Your assembler should also support very basic arithmetic expressions. > Very often there's a need to calculate the distance between data > objects or some instructions. This can be used to determine object's > size. So, your assembler should be able to assembly an instruction > like this: > MOV EAX, L2 - L1 > or a data variable declaration like this: > V DD L2 - L1 > > In fact, it must be able to assembly any instruction with a purely > additive/subtractive expression involving addresses, e.g.: > LEA EAX, L1 + 5 > or > LEA EAX, V + (L2 - L1) > and any expression involving a difference of addresses (of course, if > it supports more complex expressions): > MOV EAX, ((L2 - L1) * 5) SHR 1 > > You will need to support some form of the ORG operator/macro to tell > the assembler at what rIP value the following instruction is supposed > to be executed since the compiled x86 code generally isn't position > independent. I think some special name for the entry point should also > be supported if the ouput binary/object file is supposed to support an > arbitrary entry point and it needs to be indicated somewhere in the > file. > > A rich assembler has the following features: > 1. support for expressions > 2. support for all CPU modes > 3. macros and special symbols > 4. support for data structures > 5. support for segments > 6. code/data alighment > 7. support of intermediate object files as the output format > 8. symbolic/debug information generation > 9. support for listing files > 10. support for referencing of external objects > 11. inclusion of text source files and binary data files > 12. etc > > For a good C programmer familiar with the x86 assembly language, a > basic assembler (not including most of the above list) would probably > take no more than a month to implement and test. > > Alex Please don't insult me. By hex numbers, what I meant was I want to learn how the machine instructions are put together and how they are used together etc. I mentioned hex numbers to differentiate assembly code, meaning that I didn't want people to tell me to learn an assembly language, I want to understand it at a lower level, meaning the numeric values that represent the instructions. Of course you'd need to know this in order to write an assembler. I haven't found much that describes the anatomy of the numeric machine instructions. I didn't think of reading AMD's manuals. I checked it out. Looks pretty good. Thanks for the other info. |
|
#4
| |||
| |||
| On Sep 8, 8:23*am, Nick Mudge <spamt...@crayne.org> wrote: > On Sep 7, 5:03*pm, "Alexei A. Frounze" *<spamt...@crayne.org> wrote: > > > > > On Sep 7, 11:56*am, Nick Mudge *<spamt...@crayne.org> wrote: > > > > Hello, > > > I just wanted to know if anyone knew of any good tutorials that taught > > > how to write an assembler for x86 machine code. Also if there was any > > > good tutorial that taught the 86 machine instructions (how the hex > > > machine code works/is put together etc.). The best I know of is the > > > Intel Software Developer Manuals, but I was wondering if there might > > > be any good references/tutorials etc. online. > > > If you don't know bits, bytes and different base representations (as > > might be inferred from "how the hex machine code is put together"), > > you shouldn't be writing an assembler but instead focus on learning > > the basics. > > > Now, the mentioned intel manuals will give you the most info on how > > instructions are encoded. The AMD ones will work too. In fact, you > > should be using both when there's some uncertainty in either. And then > > you may want or need to use tools (disassemblers and other assemblers) > > to clear all confusion. > > > A basic assembler is pretty easy to do. All you need to do is: > > 1. parse the source file, get an instruction from a line (if any) > > 2. encode the instruction and emit the bytes of encoded instructions > > into a binary file > > > However, to do that you will need to take several passes over the code > > being compiled/assemblied. The reason for that is the forward > > references. That is, you don't know the address of an instruction > > ahead before you reached it by generating code for all previous > > instructions. Example: > > > CMP EAX, EBX > > JNE L1 > > MOV EAX, 0 > > JMP L2 > > L1: > > MOV EAX, 1 > > L2: > > RET > > > Here you can't fully encode JNE L1 until you know how many bytes need > > the following instructions before L1. The same applies to JMP L2 and > > instructions between it and L2. > > > Your assembler should also support very basic arithmetic expressions. > > Very often there's a need to calculate the distance between data > > objects or some instructions. This can be used to determine object's > > size. So, your assembler should be able to assembly an instruction > > like this: > > MOV EAX, L2 - L1 > > or a data variable declaration like this: > > V DD L2 - L1 > > > In fact, it must be able to assembly any instruction with a purely > > additive/subtractive expression involving addresses, e.g.: > > LEA EAX, L1 + 5 > > or > > LEA EAX, V + (L2 - L1) > > and any expression involving a difference of addresses (of course, if > > it supports more complex expressions): > > MOV EAX, ((L2 - L1) * 5) SHR 1 > > > You will need to support some form of the ORG operator/macro to tell > > the assembler at what rIP value the following instruction is supposed > > to be executed since the compiled x86 code generally isn't position > > independent. I think some special name for the entry point should also > > be supported if the ouput binary/object file is supposed to support an > > arbitrary entry point and it needs to be indicated somewhere in the > > file. > > > A rich assembler has the following features: > > 1. support for expressions > > 2. support for all CPU modes > > 3. macros and special symbols > > 4. support for data structures > > 5. support for segments > > 6. code/data alighment > > 7. support of intermediate object files as the output format > > 8. symbolic/debug information generation > > 9. support for listing files > > 10. support for referencing of external objects > > 11. inclusion of text source files and binary data files > > 12. etc > > > For a good C programmer familiar with the x86 assembly language, a > > basic assembler (not including most of the above list) would probably > > take no more than a month to implement and test. > > > Alex > > Please don't insult me. I'm sorry. I must have misinterpreted your words. > By hex numbers, what I meant was I want to > learn how the machine instructions are put together They're encoded numerically as described in the manual. And then are usually placed one after another and executed sequentially in that order unless there're jumps, calls, rets or exceptions. > and how they are used together etc. Just like C operators. The instructions are tiny building blocks, every one of them doing very little work. You use many of them to do something more significant and useful than just moving data around memory and doing arithmetic and bit operations on it. > I mentioned hex numbers to differentiate assembly > code, meaning that I didn't want people to tell me to learn an > assembly language, I want to understand it at a lower level, meaning > the numeric values that represent the instructions. Of course you'd > need to know this in order to write an assembler. I haven't found much > that describes the anatomy of the numeric machine instructions. I'm not sure what you're looking for. If it's instruction encoding, it's described in the manuals. Despite needing a lot of text and tables to describe it, it's actually pretty straightforward. Every distinct instruction has a unique numerical code that's encoded with a number of whole or fractional bytes. Many instructions have a common structure in the encoding, for example, the ModR/M bytes, which say what operand this instruction operates on. Example: ADD Eb, Gb. Eb stands for a byte that can be (depending on the encoding) a memory location or a general purpose register (AL, AH, etc). Gb is a general purpose register. Encoding: "Opcode" byte (that aforementioned unique number), followed by the ModR/M byte, possibly followed by SIB byte and/or displacement bytes. Opcode byte value is 000h. This instruction may have optional prefix bytes before the opcode byte. The complete encoding for ADD AL, AL would be: 000h (opcode byte), 0C0h (ModR/M byte). 2 top bits ("mod") of 0C0h define whether Eb is a memory or register location (11B is register, 00B, 01B, and 10B is memory). Bits 2 through 0 ("r/m") further define Eb's location (in case of a register, it's just the register's index, 0 for AL/AX/EAX; in case of a memory it's more complicated than that, but the idea is the same, mod is used together with r/m to encode the memory location). Bits 5 through 3 ("reg") define Gb location (again, 0 for AL/AX/EAX). Now, for XOR Gv, Ev, where Gv=eAX and Ev=eDI you'd have this encoding: 033h, 0F8h (mod=11B, reg=000B, r/m=111B). If for some reason you're trying to understand why it's 000h for ADD, 033h for XOR and 000B for AL/AX/EAX, it's so because many years ago intel decided it to be so. These numbers are obviously related to the CPU's internal implementation, but you don't need to worry about them much. At least, you shouldn't care about why 000B encodes AL/AX/EAX and not CL/CX/ECX. See the rest in the manuals. Ask questions if something's unclear. > I didn't think of reading AMD's manuals. I checked it out. Looks > pretty good. > Thanks for the other info. You're welcome. I'm still unsure if I got your questions and estimated your knowledge correctly. Alex |
|
#5
| |||
| |||
| Hi, On Sep 8, 10:23*am, Nick Mudge <spamt...@crayne.org> wrote: > On Sep 7, 5:03*pm, "Alexei A. Frounze" *<spamt...@crayne.org> wrote: > > On Sep 7, 11:56*am, Nick Mudge *<spamt...@crayne.org> wrote: > > > > I just wanted to know if anyone knew of any good tutorials that taught > > > how to write an assembler for x86 machine code. Also if there was any > > > good tutorial that taught the 86 machine instructions (how the hex > > > machine code works/is put together etc.). > > Please don't insult me. By hex numbers, what I meant was I want to > learn how the machine instructions are put together and how they are > used together etc. I mentioned hex numbers to differentiate assembly > code, meaning that I didn't want people to tell me to learn an > assembly language, I want to understand it at a lower level, meaning > the numeric values that represent the instructions. Of course you'd > need to know this in order to write an assembler. I haven't found much > that describes the anatomy of the numeric machine instructions. Well, there are two files in particular that I think might help you. Granted, they are somewhat outdated, but that's good right? Less to worry about (no x86-64, etc). ;-) http://www.eunet.bg/simtel.net/msdos/asmutl.html In particular, TA980705.ZIP (an assembler in its own right) has OPCODE.TXT (talking about octal codes). And the other good one is DISASM.ZIP (complete 8086/186 disassembly tables). You won't get more useful info, IMO, unless you look at the sources to other popular assemblers (NASM, GAS, JWasm, FASM, Octasm, Wolfware, etc). It depends on how simple or complex you want to make it. I'm sure you'd get more specific help if you stated what programming language, OS hosts, output formats, and bits (32, I assume) that you intend to support. |
|
#6
| |||
| |||
| "Nick Mudge" <spamtrap@crayne.org> wrote in message news:b4055ea7-157a-4e96-a43d-e8e9f04290df@n38g2000prl.googlegroups.com... > I just wanted to know if anyone knew of any good tutorials that taught > how to write an assembler for x86 machine code. No. The tables are a bit out of date. It's written in C. But, I've found it easy to modify: 486dis_c.zip http://www.eunet.bg/simtel.net/msdos/disasm.html > Also if there was any > good tutorial that taught the 86 machine instructions (how the hex > machine code works/is put together etc.). For a perspective a bit different from the manuals: http://groups.google.com/group/alt.l...b864fb18cd8f63 I found a couple of trivial errors in one of his (Mark Hopkins) earlier posts. I don't know about this one. In general, his earlier post was extremely accurate once expanded and converted into a form I could check. Rod Pemberton |
|
#7
| |||
| |||
| On Sep 7, 12:56 pm, Nick Mudge <spamt...@crayne.org> wrote: > Also if there was any good tutorial that taught the 86 machine > instructions (how the hex machine code works/is put together > etc.) The NASM manual has a fairly comprehensive listing of instructions with details on how they're built in machine code in Appendix B, as well as a very short introduction to some of the formatting used. http://home.comcast.net/~fbkotler/nasmdocb.html |
|
#8
| |||
| |||
| Nick Mudge wrote: > Hello, > I just wanted to know if anyone knew of any good tutorials that taught > how to write an assembler for x86 machine code. Also if there was any > good tutorial that taught the 86 machine instructions (how the hex > machine code works/is put together etc.). The best I know of is the > Intel Software Developer Manuals, but I was wondering if there might > be any good references/tutorials etc. online. I haven't seen any tutorials for creating complete tools. If you want to know about all CPU-instructions AMD64 VOL1..5 (pdf: 24592..24594 ,24568,24589 may be what you search for. VOL 3 contains shortcut lists and opcode maps, VOL 4/5 covers SSE. HEXTUTOR is just a quick lookup referrence and were mainly written to check the output of my disassambler. A DEMO-version (windoze-PE with a few known bugs) is available: http://web.utanet.at/schw1285/KESYS/index.htm [codesnips][HEXTUTOR] __ wolfgang |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.