@Bazzy,
The idea of using two passes was to have the lexer simply convert source code into tokens, without having to understand what the tokens mean. Then the compiler proper simply converts the tokens into bytecode, but at the same time it will be checking that tokens make sense next to each other. Alternatively, in the lexer, I could have a variable that stores the tokens that are allowed to follow the current token. I think my current model is a little easier though. It shouldn't be too difficult to split it up into multiple functions, all I was going to do was have each case in the switch call a different function (so that I don't, at first glance, have loops nested in switches nested in a switch nested in a loop, which is what I have currently).
[edit]
But your idea of writing it as a sort of state machine is interesting. I might switch to that model. I don't know what you'd call what I have at the moment. It just scans the string for tokens (or the start of a token) and then figures out where the token ends, and tacks the whole thing onto the end of a buffer, along with an integer that says what the token is. It's just an array of these:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
|
/** ttydel token types. */
typedef enum __ttydel_token_type__ {
/** Character constant. */
TTYDEL_TOKEN_CONST_CHAR,
/** Floating point constant. */
TTYDEL_TOKEN_CONST_FLOAT,
/** Integer constant. */
TTYDEL_TOKEN_CONST_INT,
/** String constant. */
TTYDEL_TOKEN_CONST_STRING,
/** Function definition. */
TTYDEL_TOKEN_DEFN_FUNC,
/** Type declaration. */
TTYDEL_TOKEN_DECL_TYPE,
/** Variable declaration. */
TTYDEL_TOKEN_DECL_VAR,
/** Binary operator. */
TTYDEL_TOKEN_OPERATOR_BI,
/** Unary operator. */
TTYDEL_TOKEN_OPERATOR_UN,
} ttydel_token_type;
/** A ttydel token. */
typedef struct __ttydel_token__ {
/** The token type. */
ttydel_token_type type;
/** The token as a string. */
char* value;
} ttydel_token;
|
I have some more tokens types to add (function calls, for example) but that's all I could think of at the time.
@EssGeEich,
For the functions, I want to allow function overloading so I'm going to take the name of a function and add its return type and the types of its parameters together (name mangling) like this:
1 2
|
int foo(int bar) => foo_int_int
int foo(float bar) => foo_int_float
|
That should make it possible for functions to not only have different parameters, but also different return types. Then, as long as every call to every function is mangled in the same way, there don't have to be any special cases. All the lexer needs to know is which version of the function is being called, which can it can figure out quite easily by looking at the symbols immediately before and after the function call, making a mangled function name, and then looking in the symbol table for a match. The symbol table will be a list of global symbol names and their types generated by the lexer (every function then has its own local symbol table) and completed by the compiler, which will go through the symbol table(s) and find the index into the bytecode buffer and write it down. Then the whole thing just gets sent as a single buffer to the server, which simply runs the program and uses it to figure out what to write and where. It sounds insecure, but the purpose of the bytecode programs is just to generate data to be drawn to the screen, there shouldn't be any way to exploit it (that's not to say no-one will find one, but if they do, it'll be something really obvious like a buffer overrun; and that's assuming anyone else ever looks at the source code or uses the program in any way).