C-style Syntax Highlighting Tutorial

G

Guest

Everyone

I have been spending weeks looking on the web for a good tutorial on how to use regular expressions and other methods to satisfy my craving for learning how to do FAST c-style syntax highlighting in C# but I have yet to find anything useful

I know there are people at MS that know this stuff like the front of their hand and I know there are many people out on the web that are proficient in doing this as well but it seems nobody would like to teach this knowledge

I understand that this is not an easy subject but does anybody know of any web tutorials (free or requires payment)
books, or anybody who has written a complete tutorial that demonstrates, from start to finish, how to implement c-style syntax highlighting? I am talking about multiline comments, strings, keywords.. all of the fun stuff including not recognizing a single double-quote in a multi-line /**/ comment as the start of a string, etc..

Any help would be greatly appreciated
 
N

Nick Malik

I'm going to answer your question in a round-about way.

There are a couple of steps to building a compiler. The first is lexical
analysis, figuring out what items are part of a token string and creating a
string of tokens. For example, in the expression A = b * 2; the character
'=' is a single token, as is the character '*', but in the expression A *=
b; the two-character token of '*=' occurs, using the same characters. This
requires a set of interpretation rules, that must be read and understood in
a particular order.

The next step is parsing. This means interpreting the code into an ordered
series of expressions and structures. Parsing yeilds a semantic tree: a
memory structure that represents the code "as it is".

In compiler development, this semantic tree is traversed by the code
generator to generate the initial object code. That object code is then
processed in repeated passes to: optimize, link, reduce, and collect
together other resources and necessary elements (like initialized memory
header blocks and registers settings).

So, when you are asking about syntax highlighting... why did I go into
compiler theory? Because the first two steps are nearly identical.

To do syntax highlighting, you have to perform the lexical analysis and the
parsing to create a semantic tree. However, your semantic tree has to be a
little more forgiving than a typical compiler would allow, because if you
are doing this to create an add-in to a text editor, then the code is being
dynamically written, so things like variables without a declaration, and
uncompleted quoted strings cannot cause your parsing to wander off the
mathematical deep end. Also, in syntax highlighting, you care about
retaining comments, when in a compiler, comments are immediately discarded.

Also, your investigation into syntax highlighting will probably need to be
able to detect methods from Framework objects (like knowing that the
expression
return sbStuff.ToString();
involves a keyword (return), a variable (sbStuff) of a particular type, and
a method on that type (ToString) which can take many forms, one of which is
the form shown (no parameters). This will require the ability to reflect
through the .NET framework in an efficient manner, something that is beyond
my experience to help you with.

In order to do this, you will have to get your parsing semantic tree and
"decorate" it with indicators that illustrate the "classification" of the
token... in other words, do you believe the token to be: a reserved word, a
constant, a method or property call, an operator (like the '.' above), a
line terminator (the ';'), etc.

Then with your decorated semantic tree, you can examine your code segment in
your highlighting area and determine what color or highlight to apply to
each object, based upon the decoration applied to the object in your
semantic tree.

Now you know why no one wanted to answer you.

This is a very brief description of one of the more difficult college
courses I had: compiler theory and the implementation of Finite State
Automata. I loved it (excellent professors... good school... go Vols!).

So if you want to learn how to do what you are doing, you will need to
become pretty good at lexical analysis (not the hardest topic, but something
that does require a good bit of math), and the needed data structures to
create a working parser. A result of this nature would have been way beyond
the time-frames that a college course would typically expect (in other
words, while most of the folks who pass this course could probably create
the lex and parse steps, there's no way that there would have been time, in
a three-and-a-half month semester, for the students to learn the material,
complete the assignment, and have the professor grade 20 submissions!

There are some tools that can help. A long time ago, the researchers at
bell labs put out two nice utilities: lexx and yacc (the generic lexical
analyzer, and 'yet another compiler compiler'). From their inspiration on
unix, hundreds of utilities have been written over the years to do similar
things. The input to tools like this is your syntax, written in a language
called BNF (or Bachus Naur Form... I may have misspelled that, but I should
be close). This is a method for expressing the lexical rules that drive
both the lexical analysis and parsing. You may be able to take a
"syntax-highlighting" text editor, which does all this for you, and simply
supply the BNF for C# and a component for reflecting on the framework...
that would be nice. Take a look at SourceForge or GotDotNet for some ideas.

If you can find one of these utilities, that would be a good starting point
for developing your highlighting parser. There may be online courses and
tutorial on lexical analysis and language parsing... I do not know. You
will probably need a tutorial on BNF as well.

Of couse, you have to decide, right now, if you want to go this deep. If
you do, many folks here will encourage you (myself included).

Good Luck,
--- Nick Malik
Solutions Architect

P.S. Regular expressions are NOT going to do this for you. Set that notion
aside. You can use Regex for some simple lexical analysis, that's it.
Parsing cannot be reasonably done (and debugged) with regex.


Bob hotmail.com> said:
Everyone,

I have been spending weeks looking on the web for a good tutorial on how
to use regular expressions and other methods to satisfy my craving for
learning how to do FAST c-style syntax highlighting in C# but I have yet to
find anything useful.
I know there are people at MS that know this stuff like the front of their
hand and I know there are many people out on the web that are proficient in
doing this as well but it seems nobody would like to teach this knowledge.
I understand that this is not an easy subject but does anybody know of any
web tutorials (free or requires payment),
books, or anybody who has written a complete tutorial that demonstrates,
from start to finish, how to implement c-style syntax highlighting? I am
talking about multiline comments, strings, keywords.. all of the fun stuff
including not recognizing a single double-quote in a multi-line /**/
comment as the start of a string, etc...
 
G

Guest

Nick

Thank you very much for the answer. However, I think I was a little broad in what I was asking for. I do, eventually, want to get into the nuts and bolts of compilers and lexical analysis but I need to start a little lighter. Let me explain where I am and why I asked the question

I am simply trying to highlight a simplistic language which only has keywords, multiline comments and strings (something like TSQL). I have already created a working syntax highlighter but the problem is that it is very slow. I have created the lexical analyzer (though it is clunky) and created a string of tokens. My problem lies in the fact that it is extremely slow. It currently takes about 17 seconds to parse about 35 printed pages of code. I am under the impression that Regular expressions can make this process extremely faster but, as you stated, I am probably wrong. I asked about doing the highlighting by using regular expressions because, in my understanding, the main purpose of regular expressions in programming is to be able to scan large amounts of text and make matches/replacing etc..

What would you say to how I should go about learning how to make the syntax highlighting of something as simplistic as what I described "fast as lightning"?
 
P

phoenix

Hello,

there are a couple of open source IDEs which offer the things you're looking
for. Maybe you should check them out to see how they do it.

For C# : #develop (http://www.icsharpcode.net/OpenSource/SD/Default.aspx)
For C/C++ : CodeMax (somewhere on the yahoo groups there is an open source
version)

I would think that a lot of the IDEs aren't keep track of everything but
only what's visible and just color whatever is inside the client area. So
they probably don't color 35 pages at once.
One other thing is that the RegEx implementation in .Net is very slow
compared to any other language.

Yves

Bob said:
Nick,

Thank you very much for the answer. However, I think I was a little broad
in what I was asking for. I do, eventually, want to get into the nuts and
bolts of compilers and lexical analysis but I need to start a little
lighter. Let me explain where I am and why I asked the question.
I am simply trying to highlight a simplistic language which only has
keywords, multiline comments and strings (something like TSQL). I have
already created a working syntax highlighter but the problem is that it is
very slow. I have created the lexical analyzer (though it is clunky) and
created a string of tokens. My problem lies in the fact that it is extremely
slow. It currently takes about 17 seconds to parse about 35 printed pages of
code. I am under the impression that Regular expressions can make this
process extremely faster but, as you stated, I am probably wrong. I asked
about doing the highlighting by using regular expressions because, in my
understanding, the main purpose of regular expressions in programming is to
be able to scan large amounts of text and make matches/replacing etc...
What would you say to how I should go about learning how to make the
syntax highlighting of something as simplistic as what I described "fast as
lightning"?
 
G

Guest

Thanks for the links and the .NET RegEx info. I will check it out and see what I can muster out of all this. (only 10MB for the source) :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top