Nick said:
First off, I apoligize if my post wasn't clear. I think that we may be
discussing the same thing using slightly different words.
ok
It does matter how long a token is in this case, since the operator ||= can
be scanned in many ways:
1) It can be scanned as three valid tokens t('|') t('|') t('=')
2) It can be scanned as two valid tokens t('||') t('=')
3) It can be scanned as two valid tokens t('|') t('|=')
4) It can be scanned as a single token t('||=')
no, it works with states. an NFA is a state machine (but infinite).
Scanning a textstream for tokens is just a statemachine, but a big one.
if the input is a||=b (let's make it complicated, as a ||= b is easy,
whitespace delimites the operands from the operator), the states could be:
..a||=b
-> identifier path start
a.||=b
-> identifier end, a is identifier or keyword
-> operator path start
a|.|=
-> operator path continue,
a||.=b
-> operator path continue
a||=.b
-> operator path end. ||= is operator
-> identifier path start
it doesn't have to look ahead more than 1 character. The state machine
automatically follows the paths of the right operator recognition, based
on the current state and the current input character. If it has seen a
'|' and current input is again '|', it doesn't recognize 2 times '|',
but it recognizes a single '||'. As there are 2 tokens with '||', it
doesn't have a match yet and thus can't create a token yet. If the '='
was omitted, the token would have been '||', but it's not so the token
is '||='.
the logic needed to get to the third scan is more difficult to produce.
This is not a simple (deterministic) finite state automata. As you pointed
out, it is non-deterministic.
yes, these state machines are not easy to write, that's why there are
tools like Lex
I have created a lexical analyzer for a production system (not a student
exercise) using symbol tables. I personally found it challenging to debug
any changes to the symbol table for two-element lookahead, and three-symbol
lookahead introduced so much concern that I withdrew three symbol tokens
from the language. The trickiest part is distinguishing between items 2 and
3 above, because most lexical analyzers that aren't hand-coded would have a
terrible time distinguishing, on a routine basis, between them.
That's the problem you get when you merge a lexical analyzer with a
parser. A parser parses tokens, not textstreams. However if you
integrate lexical analyzer logic with the parser (this is often done,
don't worry, I'm not critizising you, just look at the C# mono compiler
for example) you'll get problems because you want to make decisions
based on textinput, which is, IMHO, not correct, as you should only make
decisions based on tokens in the parser. That way you can write LL(n)
or LR(n) parsers.
It is simple as this: the more complex the logic needed, the greater the
liklihood of bugs and the lower the support of the tools (like tools that
create two-symbol lookahead tables). Each bug costs money to find and fix.
Therefore, increasing complexity for the sake of elegance at the expense of
the project is simply poor judgement on the part of the project manager.
That's great, but has nothing to do with this though. I can use a
variable call bools, which clashes at position 5 with the keyword bool,
still the C# compiler is perfectly able to handle it. Why can it do that
but can't it handle an operator of 3 positions? (btw, what about custom
operators defined by the developer?)
I wasn't discussing bit operators. I'm concerned that this discussion would
jump off the rails quickly if I were to judge the language itself, which I
am not. I am only discussing the difficulty of debugging the lexical
scanner.
... which I appreciate but IMHO was not the point, i.e.: the lexical
analyzer can perfectly tokenize a 3-pos operator like '||='. Remember,
the lexical analyzer doesn't know that '||=' is an operator, it could
also be an identifier or whitespace, it doesn't care about that. It
simply checks its current state and the current input character and
moves on to the next state, either being a token recognition or just a
new state push. A lookahead for the lexical analyzer (parsers also use
lookaheads for tokens, but that's a total different thing!) is ONLY
needed if the lexical analyzer has to make a decision what the next
state is and it can't do that based on the current state + the current
input character. This only happens if you have a textinput which looks
the same in ascii but has different meanings based on the following
character.
I am also not making a case for how useful this particular three-token
construct may seem to be. I would state, however, that there are many
_useful_ things that currently require more than a single operator. We
could debate endlessly on the "comparitive usefulness" of a myriad of
operators that are designed to make the language more useful, in our own
personal opinion. Remember that your opinion about "what is useful" may not
be universally shared, as I'm sure that mine wouldn't either.
Please go back to what is likely the reason for the set of operators
like +=, -=, *= etc.: easier usage of the language. Why is it that there
isn't an operator &&= but there is an operator &= ? And no, not because
of lexical analyzer issues, if I can write a lex + parser that can do
it, why can't MS do it?

As Jon said, it's not very common, so I then
conclude: adding the operator is not worth the effort based on that. IF
it's not very common, I can live with it. Problem is, why did I ran into
it in various cases, where I have to use a&=b although it's
theoretically 'dirty' IMHO, even though the documentation says: "The &
operator performs a bitwise AND operation on integral operands and
logical AND on bool operands.". (i.o.w.: for logical operands, we don't
do bitwise actions but we do logical actions. Which is IMHO dirty,
because as a reader of the code you have to determine what the operands
are to understand the code: a|=b. Will that do a bitwise a OR b or will
it do a logical a OR b ?
An interesting question, but I don't see how this is salient to my point. I
hope you don't mind if I don't attempt to respond to it.
oh no problem, it was more of a response to the other reactions in the
thread as well (as I did with a couple of other snippets above). This
example was to illustrate how some constructs are said to be 'uncommon'
while they are just 'unknown' to these people.
Frans
--