Need Help with Parsing Primitives

J

Jonathan Wood

I'm writing a generic parsing class and have been thinking about how to
implement four low-level methods that the class will need. But, I'm kind of
stuck.

[Function 1] Scan for the first occurrence of a character.
I need to support case-insensitive scans so String.IndexOf(char) won't work.
However, it looks like I can convert the character to a string and then use
String.IndexOf(string).

[Function 2] Scan for the first occurrence of any of several characters.
Again, I need to support case-insensitive scans so String.IndexOfAny() won't
work. I can again convert the character to a string but then the scan
becomes very inefficient as I need to scan for all strings and compare which
one came first.

[Function 3] Scan for the first character that is not equal to a specified
character.
Similar problems trying to support case-insensitive scans.

[Function 4] Scan for the first character that is not equal to any of
several characters.
Similar problems trying to support case-insensitive scans.

I started by just writing a loop that compared each character. But since
there's no satisfactory way to do this ignoring case, I have to convert
target characters to string. And the only way I can think of to find the
first occurrence of multiple strings is to search for every one of them and
see which came first, which is horribly inefficient when the text is long.

Could anyone think of a better approach?

Thanks.

Jonathan
 
P

Peter Duniho

Jonathan said:
I'm writing a generic parsing class and have been thinking about how to
implement four low-level methods that the class will need. But, I'm kind
of stuck.

[...four proposed functions...]

I started by just writing a loop that compared each character. But since
there's no satisfactory way to do this ignoring case, I have to convert
target characters to string. And the only way I can think of to find the
first occurrence of multiple strings is to search for every one of them
and see which came first, which is horribly inefficient when the text is
long.

Could anyone think of a better approach?

Not off the top of my head. However, it would help if you could
describe the parsing implementation you're following in which this kind
of scanning is required. Parsing does not usually involve searching for
specific characters repeatedly in a string.

At the very least, it would be helpful if you could elaborate on these
stated requirement that you be able to "find the first occurrence that
is a given character(s)" and "to find the first occurrence that's not a
given character(s)".

If you can elaborate on what exactly you're parsing, that could be
helpful too.

Pete
 
P

Peter Duniho

Peter said:
Jonathan said:
I'm writing a generic parsing class and have been thinking about how
to implement four low-level methods that the class will need. But, I'm
kind of stuck.

[...four proposed functions...]

I started by just writing a loop that compared each character. But
since there's no satisfactory way to do this ignoring case, I have to
convert target characters to string. And the only way I can think of
to find the first occurrence of multiple strings is to search for
every one of them and see which came first, which is horribly
inefficient when the text is long.

Could anyone think of a better approach?

Not off the top of my head. [...]

Actually, maybe I can. How long are the input strings? And how many
times do you need to do this kind of search for each input string?

Is it feasible to generate an index by character? It could be that
scanning the string once, generating an index as you go, and then once
that's built looking up individual characters looking for the lowest
character position could be more efficient than repeatedly searching the
entire string over and over for multiple characters.

If I think of anything else, I'll follow-up. Still, I think that if
you're just doing parsing, there's something fishy about your chosen
implementation. It would still be helpful if you could explain where
that particular implementation came from and why it is what you feel is
necessary.

Pete
 
J

Jonathan Wood

Peter said:
At the very least, it would be helpful if you could elaborate on these
stated requirement that you be able to "find the first occurrence that is
a given character(s)" and "to find the first occurrence that's not a given
character(s)".

Forgive me, as I'm not understanding how these are unclear. If it helps,
here are some examples (of many possible examples):

[Function 1] Find the index of the next 'a' character. (case-insensitive
IndexOf)

[Function 2] Find the index of the next 'a', 'b', or 'c' character.
(case-insensitve IndexOfAny)

[Function 3] Find the index of the next character that is not 'a'.

[Function 4] Find the index of the next character that is not 'a', 'b', or
'c'.
If you can elaborate on what exactly you're parsing, that could be helpful
too.

I thought I was very specific that I'm working on a generic class right now.
It makes no requirements on what is being parsed. In my current project,
I'll be using it to parse HTML, but it could just as easily be used to parse
source code or something else entirely.
Could anyone think of a better approach?

Not off the top of my head. [...]

Actually, maybe I can. How long are the input strings? And how many
times do you need to do this kind of search for each input string?

The strings are potentially very long. Many scans would be done but it would
be done sequentially. That is, it would start from the beginning of the
string and, once a scan finds what it's looking for, the next scan would
continue from that location. One potential use (of many) of this class would
be a programming language parser, that needs to identify and extract each
token from the source code.
Is it feasible to generate an index by character? It could be that
scanning the string once, generating an index as you go, and then once
that's built looking up individual characters looking for the lowest
character position could be more efficient than repeatedly searching the
entire string over and over for multiple characters.

I'm not sure if I fully understand this. As pointed out, the string is
always scanned from the "current position" to the end of the file, and the
current position advances as each token is identified. At any rate, how
would you use this to solve the case-insensitive issue? If you find the
index of an 'A' and then later find the index of an 'a', how would you use
those indices to locate 'A' ignoring case?

Thanks.

Jonathan
 
P

Peter Duniho

Jonathan said:
Peter said:
At the very least, it would be helpful if you could elaborate on these
stated requirement that you be able to "find the first occurrence that
is a given character(s)" and "to find the first occurrence that's not
a given character(s)".

Forgive me, as I'm not understanding how these are unclear. If it helps,
here are some examples (of many possible examples):

[Function 1] Find the index of the next 'a' character. (case-insensitive
IndexOf)

[Function 2] Find the index of the next 'a', 'b', or 'c' character.
(case-insensitve IndexOfAny)

[Function 3] Find the index of the next character that is not 'a'.

[Function 4] Find the index of the next character that is not 'a', 'b',
or 'c'.

It's not that the functions you're describing are unclear themselves.
What's unclear is the source of the requirement that those functions be
implemented at all.
I thought I was very specific that I'm working on a generic class right
now. It makes no requirements on what is being parsed. In my current
project, I'll be using it to parse HTML, but it could just as easily be
used to parse source code or something else entirely.

Unless you are a grad student, I think maybe you might want to rethink
this goal. What you're asking to do is a highly non-trivial problem,
and that is in fact why people write specific parsers for specific tasks.

HTML is relatively simple to parse. If what you need to parse right now
is just HTML, you should probably focus on that. It will take a
fraction of the time to implement an HTML parser than to implement
something that is truly general purpose. (Or you could even just use
one of the many HTML parsers available).

In fact, a truly general purpose parser using current technology will
essentially be a number of specific kinds of parsers in one. The
specific parsing techniques necessary depend so much on the exact
grammar of whatever you're parsing that, AFAIK, no one has come up with
a general-purpose way of describing grammars such that a single parsing
implementation can take care of any arbitrary grammar.

Now, from your general description, it seems like maybe you're not
really even at the point where you're trying to parse the input.
Instead, it sounds like you're dealing with the lexical analysis part.
And that's generally a simpler thing to generalize, I think.

But even there, you need to know what kind of parser you're going to
feed with the lexical output. I'm not aware of many parsers that take
as input individual characters (*), and for tokenizing input, the kind
of repeated scanning for specific characters is not something I've ever
seen.

(*) (an obvious exception being those parsers based on finite state
machines; and these can handle case-insensitivity just fine, as long as
you can represent your token choices in a way that can be translated
into the finite state machine; e.g. provide all possible variations, or
at least some subset that can be used to automatically generate all
possible variations, so that the FSM can be be built accordingly).

If you have some specific reference you're following that has suggested
this particular implementation, where you are scanning repeatedly
character-by-character, it would be helpful if you could provide a link
or other pointer to that reference so we can understand what you're doing.

On the other hand, if this technique is something you've invented on
your own, I would strongly suggest you research existing lexical
analysis and parsing technology (a field that's been around practically
as long as computers have) before trying to reinvent the wheel. It
would probably involve a lot fewer headaches. :)
Could anyone think of a better approach?

Not off the top of my head. [...]

Actually, maybe I can. How long are the input strings? And how many
times do you need to do this kind of search for each input string?

The strings are potentially very long. Many scans would be done but it
would be done sequentially. That is, it would start from the beginning
of the string and, once a scan finds what it's looking for, the next
scan would continue from that location. One potential use (of many) of
this class would be a programming language parser, that needs to
identify and extract each token from the source code.

There is definitely no need for that kind of scanning for parsing
programming languages. Programming languages have clear rules for
delimiting tokens, and generating the tokens is a simple affair of just
scanning through a string, breaking it up as needed. The String.Split()
method is nearly all you need, if it weren't for the fact that most
programming languages allow you to omit spaces between tokens where
there's no ambiguity (i.e. in C, you can't write "inti" and have the
lexer figure out the tokens are "int" and "i", but you can write "int
i,j" and it will know your tokens are "int", "i", ",", and "j"…in fact,
some implementations might just use the "," as a delimiter, depending on
how the rest of the analysis is handled).
I'm not sure if I fully understand this. As pointed out, the string is
always scanned from the "current position" to the end of the file, and
the current position advances as each token is identified. At any rate,
how would you use this to solve the case-insensitive issue? If you find
the index of an 'A' and then later find the index of an 'a', how would
you use those indices to locate 'A' ignoring case?

Your index would treat all characters that are equivalent ignoring case
as the same indexed entity. So your index would have a single entry,
corresponding to both 'A' and 'a'. Of course, this means you have to
have some rules for mapping a given character to a specific entity.

Again, if you're willing to make some restrictions on the input – for
example, being case-insensitive only for characters in the ASCII range –
then this mapping is trivial. Otherwise, you need some way of
representing the mapping, and converting individual characters according
to that representation (e.g. a hashtable containing all Unicode
characters that have case, mapping to some specific value that
identifies all characters that map to that value).

Pete
 
J

Jonathan Wood

Peter said:
There is definitely no need for that kind of scanning for parsing
programming languages. Programming languages have clear rules for
delimiting tokens, and generating the tokens is a simple affair of just
scanning through a string, breaking it up as needed. The String.Split()
method is nearly all you need, if it weren't for the fact that most
programming languages allow you to omit spaces between tokens where
there's no ambiguity (i.e. in C, you can't write "inti" and have the lexer
figure out the tokens are "int" and "i", but you can write "int i,j" and
it will know your tokens are "int", "i", ",", and "j"…in fact, some
implementations might just use the "," as a delimiter, depending on how
the rest of the analysis is handled).

String.Split? How would you determine the meaning of text like
"c+=(b-2)*47;" using String.Split?
Your index would treat all characters that are equivalent ignoring case as
the same indexed entity. So your index would have a single entry,
corresponding to both 'A' and 'a'. Of course, this means you have to have
some rules for mapping a given character to a specific entity.

Yeah, that would get a bit complex if I didn't limit to a particular
language or range, etc.

As I mentioned in the other thread, just being case-sensitive is sounding
better and better. If the caller wants to be case-insensitive, it'll just
have to specify both versions of each character.

Jonathan
 
P

Peter Duniho

Jonathan said:
String.Split? How would you determine the meaning of text like
"c+=(b-2)*47;" using String.Split?

You don't determine the _meaning_. And I wrote "_nearly_ all you need".
But, your current line of questioning seems to involve the tokenizing,
not the actual parsing. And if the input were "c += ( b - 2 ) * 47",
String.Split() could tokenize that trivially.

Of course, for something like that, you can't count on the expression
having those spaces. Hence, "nearly".

Pete
 
F

Fred Mellender

Just to elaborate on what Peter said:

Here is a fragment of the "lexer" that I wrote for a parser. It depends on
the fact that no token can span a line. This allows you to read the input a
line at a time. An exception for comments is allowed.

The idea is to have a Regex expression that identifies each token. Once the
token is identified, the input line is
reduced to what is left. You can stop when the line's length is reduced to
0.

The Regex rules for identifying tokens are put into an array. This is
scanned, one rule at a time, to identify the next token. It is allowed that
a token can have multiple rules that identify it, but only the first rule
that results in a match is used to identify the token. Hence the most
general rules are placed last in the array.

Good luck with your project!

--------------------------------------------------------------------------

private Token getToken(ref string inputLine, out LexRule ruleUsed)
{
//Given a string (inputLine) match it with a reg expression, and
form
//a token. Then, reduce the inputLine by removing the text that
//was used to form the token. If the inputLine is exhausted,
and
//we cannot return a token, return null. After token text is
//removed from inputLine, trim the whitespace from the
beginning.
//If we are processing a rule with a scanTo regex, we do not
//trim the whitespace, since the caller will capture all
//characters through the end-delimiter token.

//N.B. the lex regex are applied in the order they were given in
//the lexRule file. Hence these must be in an order that
disallows
//a prior rule's match from always preventing a subsequent
//rule's success.

ruleUsed = null;
for (int i = 0; i < rules.Count; i++)
{
string ruleRegex = rules.regEx;
Match match;
Regex regexObject;
try
{
regexObject = new Regex(ruleRegex);
match = regexObject.Match(inputLine);
}

catch (Exception ex)
{
//probably, regEx (from rule) was poorly formed:
//a user error. Should not happen.
throw new LexException(
"Lex03: Regex exception: " + ex.Message);
}

if (match.Success && match.Index == 0 && match.Length > 0)
{
//regEx matches at first character of input.
Token tok = factory.newToken(rules.name,
inputLine.Substring(0, match.Length), lineNumber,
nextSeq++);

if (match.Length >= inputLine.Length)
inputLine = "";
else inputLine = inputLine.Substring(match.Length);
//reduce the input line to what's left

if (rules.scanTo == null) //caller handles the scan
inputLine = inputLine.TrimStart(new char[] { ' ',
'\t' });

ruleUsed = rules;
tok.regex = regexObject;
if (tok.name == "!")
//a comment, or other text to skip over
tok.type = TokenType.SKIP; //caller must handle the
skip
return tok;
}
}

return null;
-----------------------------------------------------------------------------------
 
J

Jonathan Wood

Peter said:
You don't determine the _meaning_. And I wrote "_nearly_ all you need".
But, your current line of questioning seems to involve the tokenizing,
not the actual parsing. And if the input were "c += ( b - 2 ) * 47",
String.Split() could tokenize that trivially.

Obviously. But that wasn't the question was it?

Jonathan
 
J

Jonathan Wood

Fred said:
Just to elaborate on what Peter said:

Here is a fragment of the "lexer" that I wrote for a parser. It depends
on the fact that no token can span a line. This allows you to read the
input a line at a time. An exception for comments is allowed.

The idea is to have a Regex expression that identifies each token. Once
the token is identified, the input line is
reduced to what is left. You can stop when the line's length is reduced
to 0.

Thanks for the input. I had seen regular expressions used in some of the
code I was checking out. That would not be my first choice and I'm
frustrated that there isn't a simple way to compare two characters without
regard to case.

But your example is interesting and I'll study it further.

Thanks again.

Jonathan
 
J

Jesse Houwing

* Jonathan Wood wrote, On 16-1-2010 16:10:
Thanks for the input. I had seen regular expressions used in some of the
code I was checking out. That would not be my first choice and I'm
frustrated that there isn't a simple way to compare two characters
without regard to case.

It is very simple to do, but you need to convert them to string first.
The whole problem with characters is that, depending on the character
set and the encoding of the actual data, an actual character could span
multiple bytes. And one character might or might not be the same
(regardless of case) depending on the current Culture and the current
encoding. So to determine if a char value, is or is not the same as
another char value regardless of case cannot be said, without specifying
how the character is encoded, whether or not it consists of multiple
chars (bytes) and so much more.

That is why comparison is done as a string. Because when you have put
the char into a string (with the right encoding and such) the chars of
that string can be compared with all those things in mind.

Coming back to those comparisons:
1) you could use string.Equals(string a, string b, CompareOption c) and
specify a compareOption liek so:

CompareOptions compareOptions = CompareOptions.IgnoreCase
| CompareOptions.IgnoreWidth
| CompareOptions.IgnoreNonSpace;
string.Compare(strA, strB, CultureInfo.CurrentCulture, compareOptions) == 0


2) You should definately consider regular expressions. Many parsers
(generated or not) rely at some point on a form of Regular Expressions
to look at the data being parsed. Regular expressions allow you to
search case insensitive with ease and also take the whole Culture stuff
into consideration.

Another note: if parsing of HTML is your main goal. Stop now and look at
the HTMLAgilityPack on CodePlex, unless you're trying to build this in
order to learn building your own parser. If that is your main goal, go
to the nearest library or bookstore and get a book on building compilers
and lexer/parsers. Any book will do, as learning how it's generally done
is your first goal. Applying that to the .NET framework should then be easy.

If you want to build just a language parser, and not the whole stuff
around that, go download the Visual Studio SDK and try MPLex/MPPG, or
find an open source Parser/Lexer:

http://msdn.microsoft.com/en-us/library/bb165963.aspx
http://en.wikipedia.org/wiki/List_of_C_Sharp_lexer_generators

Kind Regards,
 
F

Fred Mellender

I did not read your original post carefully, nor the follow-ups, but maybe
this answers your question (?)

char char1 = 'a' , char2= 'B';
if (Char.ToLower(char1) == Char.ToLower(char2))
{
; //characters are the same, except for case...
}

If you are writing a parser I suggest you use the web or (better) a book to
study up on the various architectures for doing so. There is a lot of
history and much useful information available. Most designs I am aware of
divide the parsing phase from the "lexical analysis" phase. The latter
transforms the input stream into a list of tokens before the actual parsing
begins. Although it is possible to combine these phases (and for some
problems it is necessary), most of the time you can keep them completely
separate. I don't know any efficient way of writing the lexical phase
without using (or writing from scratch) a regular expression identifier.

Once you have the list of tokens there are at least 2 approaches to the
parsing phase: top down (e.g. "recursive descent"), and bottom up (e.g.
"shift-reduce").

The fragment I included in my post is part of a larger project which you can
download for free:

http://sites.google.com/site/fredm/linguist-parsing-system
 
P

Peter Duniho

Jonathan said:
Obviously. But that wasn't the question was it?

What's your point?

I made clear in the context of my mention of String.Split() that it was
for illustrative purposes only and not a suggestion for applying to your
own problem. So, why does it matter that "that wasn't the question"?
Can you not tolerate illustration for the point of illustration, without
the illustration being a directly useful implementation detail for your
own problem?

Pete
 
J

Jonathan Wood

My point was: String.Split() is absolutely useless for tokenizing any sort
of human-created text.

Jonathan
 
J

Jonathan Wood

Jesse said:
It is very simple to do, but you need to convert them to string first. The
whole problem with characters is that, depending on the character set and
the encoding of the actual data, an actual character could span multiple
bytes. And one character might or might not be the same (regardless of
case) depending on the current Culture and the current encoding. So to
determine if a char value, is or is not the same as another char value
regardless of case cannot be said, without specifying how the character is
encoded, whether or not it consists of multiple chars (bytes) and so much
more.

This was mentioned before. Obviously, it's less efficient to have to convert
every char to a string object. But it sounds like that's the only way to
address these issues. I appreciate the additional explanation.

However, what I'm doing then becomes a little more complex. For example, one
function needs to duplicate String.IndexOfAny(), which doesn't work with
strings. About all I can think of is scanning for all strings I'm searching
for and then seeing which came first, which isn't too efficient.
Coming back to those comparisons:
1) you could use string.Equals(string a, string b, CompareOption c) and
specify a compareOption liek so:

Yeah, that issue is much easier once I'm working with strings instead of
char.
2) You should definately consider regular expressions. Many parsers
(generated or not) rely at some point on a form of Regular Expressions to
look at the data being parsed. Regular expressions allow you to search
case insensitive with ease and also take the whole Culture stuff into
consideration.

Yeah, definitely something to think about.
Another note: if parsing of HTML is your main goal. Stop now and look at
the HTMLAgilityPack on CodePlex, unless you're trying to build this in
order to learn building your own parser. If that is your main goal, go to
the nearest library or bookstore and get a book on building compilers and
lexer/parsers. Any book will do, as learning how it's generally done is
your first goal. Applying that to the .NET framework should then be easy.

Well, I wrote an interpreter a long time ago in C++. But it didn't worry
about Unicode and cultures, etc., etc.

I'll take a look at HTMLAgilityPack. I suspect it will be way more than I
need for my current project but it might be interesting to see how it's
implemented.

Thanks.

Jonathan
 
J

Jonathan Wood

Fred said:
I did not read your original post carefully, nor the follow-ups, but maybe
this answers your question (?)

char char1 = 'a' , char2= 'B';
if (Char.ToLower(char1) == Char.ToLower(char2))
{
; //characters are the same, except for case...
}

Surprisingly for me, I read comments that suggested this may not be
reliable. For example, an exchange on StackOverflow I referenced had a
suggestion to compare the results of ToUpper() but then to also try
comparing the results of ToLower() if the first comparison didn't match.
If you are writing a parser I suggest you use the web or (better) a book
to study up on the various architectures for doing so. There is a lot of
history and much useful information available. Most designs I am aware of
divide the parsing phase from the "lexical analysis" phase. The latter
transforms the input stream into a list of tokens before the actual
parsing begins. Although it is possible to combine these phases (and for
some problems it is necessary), most of the time you can keep them
completely separate. I don't know any efficient way of writing the
lexical phase without using (or writing from scratch) a regular expression
identifier.

Yeah, I went through a few books in the past when I wrote an interpreter in
C++. It's just the character comparison that's an issue for me now.
The fragment I included in my post is part of a larger project which you
can download for free:

http://sites.google.com/site/fredm/linguist-parsing-system

Cool. Thanks.

Jonathan
 
P

Peter Duniho

Jonathan said:
My point was: String.Split() is absolutely useless for tokenizing any
sort of human-created text.

Your "point" is false. String.Split() is frequently used in simple
tokenizing situations, even those where humans create the input from
scratch. Humans are perfectly capable of creating text that complies
with the constraints of these simple situations, and frequently do.

It's not applicable for things like C or HTML, but to say it's
"absolutely useless" suggests you don't really comprehend the full
application gamut of the process of tokenizing.

Pete
 
J

Jesse Houwing

* Fred Mellender wrote, On 16-1-2010 16:59:
I did not read your original post carefully, nor the follow-ups, but
maybe this answers your question (?)

char char1 = 'a' , char2= 'B';
if (Char.ToLower(char1) == Char.ToLower(char2))
{
; //characters are the same, except for case...
}

If you go this way, make sure you use ToUpper to compare two characters
for equality regardless of case. Some characters don't roundtrip to the
same bace character when you try ToLower, but all characters work when
you do ToUpper.

There's even a stringcomparison rule for that in FxCop, further
explanation can also be found there.

http://msdn.microsoft.com/en-us/library/bb386042(VS.100).aspx
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top