How to Parse a string with Embedded Double Quotes

C

Charles Law

I have a string similar to the following:

" MyString 40 "Hello world" all "

It contains white space that may be spaces or tabs, or a combination, and I
want to produce an array with the following elements

arr(0) = "MyString"
arr(1) = 40
arr(2) = "Hello world"
arr(3) = "all"

Using trim and a regular expression ("\s+"), I can reduce my string to

"MyString 40 "Hello world" all"

and with Split I can get

arr(0) = "MyString"
arr(1) = 40
arr(2) = ""Hello"
arr(3) = "world""
arr(4) = "all"

As you can see, it is not quite what I need. The spaces in "Hello world"
have been reduced to a single space, and Split does not respect the double
quotes, and splits "Hello world" over two elements.

Does anyone have an idea how I could do this? I could process the string
character by character, but I am hoping that there is a straight-forward
technique for doing it, without looping, and using some of the techniques I
already have.

TIA

Charles
 
C

Cor Ligthert

Charles,

I was looking at the problem, I was thinking will I give my answer because
it is so difficult to describe. Than I saw that it was you. Therefore it
should not be a problem.

In this kind of situations I replace the spaces I will not use for an
absolute unused character.

Do the split

And replace the unused character again back for a space

I assume that this is for you more than enough explanation.

And now you read this you say, I knew that as well.

:)))

Cor

"Charles Law" <[email protected]>

....
 
C

Charles Law

Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am back to
processing each part of the string character by character so that I match
double quotes correctly, and this is what I was trying to avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do not
know it yet.

But please, keep the suggestions flowing.

Charles
 
R

Robby

Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or the
word token without the whitespace characters depending on which match the
Match object holds.

You just have to love Regular Expressions.

--Robby
 
C

Charles Law

Hi Robby

Thanks for the reply. I am not sure that I understand the regular expression
(\s*"([\s\w]*)")|(\s*(\w+))

I tried the following, but of course it gives a syntax error because of the
embedded double quotes:

Dim reg As Regex = New Regex("(\s*"([\s\w]*)")|(\s*(\w+))")

So I tried escaping the double quotes, like this

Dim reg As Regex = New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")

but this cleared my string out to a couple of spaces when I did a replace.

Any chance of a small snippet to get me on the right track, using the Match
object?

Thanks very much.

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which match
the Match object holds.

You just have to love Regular Expressions.

--Robby




Charles Law said:
Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am back
to processing each part of the string character by character so that I
match double quotes correctly, and this is what I was trying to avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do not
know it yet.

But please, keep the suggestions flowing.

Charles
 
C

Charles Law

Hi Robby - me again

I have it now; I just needed to apply a few of those grey cells I have
knocking about.

Cheers.

Charles


Charles Law said:
Hi Robby

Thanks for the reply. I am not sure that I understand the regular
expression
(\s*"([\s\w]*)")|(\s*(\w+))

I tried the following, but of course it gives a syntax error because of
the embedded double quotes:

Dim reg As Regex = New Regex("(\s*"([\s\w]*)")|(\s*(\w+))")

So I tried escaping the double quotes, like this

Dim reg As Regex = New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")

but this cleared my string out to a couple of spaces when I did a replace.

Any chance of a small snippet to get me on the right track, using the
Match object?

Thanks very much.

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which match
the Match object holds.

You just have to love Regular Expressions.

--Robby




Charles Law said:
Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am
back to processing each part of the string character by character so
that I match double quotes correctly, and this is what I was trying to
avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do
not know it yet.

But please, keep the suggestions flowing.

Charles


Charles,

I was looking at the problem, I was thinking will I give my answer
because it is so difficult to describe. Than I saw that it was you.
Therefore it should not be a problem.

In this kind of situations I replace the spaces I will not use for an
absolute unused character.

Do the split

And replace the unused character again back for a space

I assume that this is for you more than enough explanation.

And now you read this you say, I knew that as well.

:)))

Cor

"Charles Law" <[email protected]>

...
I have a string similar to the following:

" MyString 40 "Hello world" all "

It contains white space that may be spaces or tabs, or a combination,
and I want to produce an array with the following elements

arr(0) = "MyString"
arr(1) = 40
arr(2) = "Hello world"
arr(3) = "all"

Using trim and a regular expression ("\s+"), I can reduce my string to

"MyString 40 "Hello world" all"

and with Split I can get

arr(0) = "MyString"
arr(1) = 40
arr(2) = ""Hello"
arr(3) = "world""
arr(4) = "all"

As you can see, it is not quite what I need. The spaces in "Hello
world" have been reduced to a single space, and Split does not respect
the double quotes, and splits "Hello world" over two elements.

Does anyone have an idea how I could do this? I could process the
string character by character, but I am hoping that there is a
straight-forward technique for doing it, without looping, and using
some of the techniques I already have.

TIA

Charles
 
R

Robby

Create a console application

#########################

Imports System.Text.RegularExpressions

Module MainModule

Sub Main()

Dim rePost As New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")
Dim testString As String = "MyString 40 ""Hello world""
all "
Dim allMatches As MatchCollection = rePost.Matches(testString)

Dim matchPiece As Match
Dim I As Integer

For I = 0 To allMatches.Count - 1
matchPiece = allMatches(I)
Console.WriteLine("Piece {0} -> '{1}'", I,
matchPiece.Result("$2$4"))
Next I

End Sub

End Module

####################

--Robby

Charles Law said:
Hi Robby

Thanks for the reply. I am not sure that I understand the regular
expression
(\s*"([\s\w]*)")|(\s*(\w+))

I tried the following, but of course it gives a syntax error because of
the embedded double quotes:

Dim reg As Regex = New Regex("(\s*"([\s\w]*)")|(\s*(\w+))")

So I tried escaping the double quotes, like this

Dim reg As Regex = New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")

but this cleared my string out to a couple of spaces when I did a replace.

Any chance of a small snippet to get me on the right track, using the
Match object?

Thanks very much.

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which match
the Match object holds.

You just have to love Regular Expressions.

--Robby




Charles Law said:
Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am
back to processing each part of the string character by character so
that I match double quotes correctly, and this is what I was trying to
avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do
not know it yet.

But please, keep the suggestions flowing.

Charles


Charles,

I was looking at the problem, I was thinking will I give my answer
because it is so difficult to describe. Than I saw that it was you.
Therefore it should not be a problem.

In this kind of situations I replace the spaces I will not use for an
absolute unused character.

Do the split

And replace the unused character again back for a space

I assume that this is for you more than enough explanation.

And now you read this you say, I knew that as well.

:)))

Cor

"Charles Law" <[email protected]>

...
I have a string similar to the following:

" MyString 40 "Hello world" all "

It contains white space that may be spaces or tabs, or a combination,
and I want to produce an array with the following elements

arr(0) = "MyString"
arr(1) = 40
arr(2) = "Hello world"
arr(3) = "all"

Using trim and a regular expression ("\s+"), I can reduce my string to

"MyString 40 "Hello world" all"

and with Split I can get

arr(0) = "MyString"
arr(1) = 40
arr(2) = ""Hello"
arr(3) = "world""
arr(4) = "all"

As you can see, it is not quite what I need. The spaces in "Hello
world" have been reduced to a single space, and Split does not respect
the double quotes, and splits "Hello world" over two elements.

Does anyone have an idea how I could do this? I could process the
string character by character, but I am hoping that there is a
straight-forward technique for doing it, without looping, and using
some of the techniques I already have.

TIA

Charles
 
C

Cor Ligthert

Charles,

Can you give us an idea at the end how much time it took to find the regex
and how much time the straight forward technique and than as well a test
what will be the less time consuming method.

To get a good idea about the discussions using Regex or Straight forward, I
looked at it, and I think Straight forward should take me probably less than
30 minutes, so for you probably less than 15.

:)

Cor

Charles Law said:
Hi Robby

Thanks for the reply. I am not sure that I understand the regular
expression
(\s*"([\s\w]*)")|(\s*(\w+))

I tried the following, but of course it gives a syntax error because of
the embedded double quotes:

Dim reg As Regex = New Regex("(\s*"([\s\w]*)")|(\s*(\w+))")

So I tried escaping the double quotes, like this

Dim reg As Regex = New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")

but this cleared my string out to a couple of spaces when I did a replace.

Any chance of a small snippet to get me on the right track, using the
Match object?

Thanks very much.

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which match
the Match object holds.

You just have to love Regular Expressions.

--Robby




Charles Law said:
Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am
back to processing each part of the string character by character so
that I match double quotes correctly, and this is what I was trying to
avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do
not know it yet.

But please, keep the suggestions flowing.

Charles


Charles,

I was looking at the problem, I was thinking will I give my answer
because it is so difficult to describe. Than I saw that it was you.
Therefore it should not be a problem.

In this kind of situations I replace the spaces I will not use for an
absolute unused character.

Do the split

And replace the unused character again back for a space

I assume that this is for you more than enough explanation.

And now you read this you say, I knew that as well.

:)))

Cor

"Charles Law" <[email protected]>

...
I have a string similar to the following:

" MyString 40 "Hello world" all "

It contains white space that may be spaces or tabs, or a combination,
and I want to produce an array with the following elements

arr(0) = "MyString"
arr(1) = 40
arr(2) = "Hello world"
arr(3) = "all"

Using trim and a regular expression ("\s+"), I can reduce my string to

"MyString 40 "Hello world" all"

and with Split I can get

arr(0) = "MyString"
arr(1) = 40
arr(2) = ""Hello"
arr(3) = "world""
arr(4) = "all"

As you can see, it is not quite what I need. The spaces in "Hello
world" have been reduced to a single space, and Split does not respect
the double quotes, and splits "Hello world" over two elements.

Does anyone have an idea how I could do this? I could process the
string character by character, but I am hoping that there is a
straight-forward technique for doing it, without looping, and using
some of the techniques I already have.

TIA

Charles
 
C

Charles Law

Thanks again Robby. I came up with something similar in the end, using For
.... Each to go through the match collection, and reg.Replace instead of
match.Result, but it came down to the same thing.

Charles


Robby said:
Create a console application

#########################

Imports System.Text.RegularExpressions

Module MainModule

Sub Main()

Dim rePost As New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")
Dim testString As String = "MyString 40 ""Hello world"" all
"
Dim allMatches As MatchCollection = rePost.Matches(testString)

Dim matchPiece As Match
Dim I As Integer

For I = 0 To allMatches.Count - 1
matchPiece = allMatches(I)
Console.WriteLine("Piece {0} -> '{1}'", I,
matchPiece.Result("$2$4"))
Next I

End Sub

End Module

####################

--Robby

Charles Law said:
Hi Robby

Thanks for the reply. I am not sure that I understand the regular
expression
(\s*"([\s\w]*)")|(\s*(\w+))

I tried the following, but of course it gives a syntax error because of
the embedded double quotes:

Dim reg As Regex = New Regex("(\s*"([\s\w]*)")|(\s*(\w+))")

So I tried escaping the double quotes, like this

Dim reg As Regex = New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")

but this cleared my string out to a couple of spaces when I did a
replace.

Any chance of a small snippet to get me on the right track, using the
Match object?

Thanks very much.

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which
match the Match object holds.

You just have to love Regular Expressions.

--Robby




Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am
back to processing each part of the string character by character so
that I match double quotes correctly, and this is what I was trying to
avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do
not know it yet.

But please, keep the suggestions flowing.

Charles


Charles,

I was looking at the problem, I was thinking will I give my answer
because it is so difficult to describe. Than I saw that it was you.
Therefore it should not be a problem.

In this kind of situations I replace the spaces I will not use for an
absolute unused character.

Do the split

And replace the unused character again back for a space

I assume that this is for you more than enough explanation.

And now you read this you say, I knew that as well.

:)))

Cor

"Charles Law" <[email protected]>

...
I have a string similar to the following:

" MyString 40 "Hello world" all "

It contains white space that may be spaces or tabs, or a combination,
and I want to produce an array with the following elements

arr(0) = "MyString"
arr(1) = 40
arr(2) = "Hello world"
arr(3) = "all"

Using trim and a regular expression ("\s+"), I can reduce my string
to

"MyString 40 "Hello world" all"

and with Split I can get

arr(0) = "MyString"
arr(1) = 40
arr(2) = ""Hello"
arr(3) = "world""
arr(4) = "all"

As you can see, it is not quite what I need. The spaces in "Hello
world" have been reduced to a single space, and Split does not
respect the double quotes, and splits "Hello world" over two
elements.

Does anyone have an idea how I could do this? I could process the
string character by character, but I am hoping that there is a
straight-forward technique for doing it, without looping, and using
some of the techniques I already have.

TIA

Charles
 
C

Charles Law

Cor

Do you mean how long does it take to parse the string using RegEx against
parsing it character by character, or how long did it take to come up with
the solution?

I think the RegEx solution is by far the neatest, and most flexible. It is
also fewer lines of code. I do not have a solution parsing
character-by-character, so I cannot measure how long to create or run, but I
think you have it about right.

Charles


Cor Ligthert said:
Charles,

Can you give us an idea at the end how much time it took to find the regex
and how much time the straight forward technique and than as well a test
what will be the less time consuming method.

To get a good idea about the discussions using Regex or Straight forward,
I looked at it, and I think Straight forward should take me probably less
than 30 minutes, so for you probably less than 15.

:)

Cor

Charles Law said:
Hi Robby

Thanks for the reply. I am not sure that I understand the regular
expression
(\s*"([\s\w]*)")|(\s*(\w+))

I tried the following, but of course it gives a syntax error because of
the embedded double quotes:

Dim reg As Regex = New Regex("(\s*"([\s\w]*)")|(\s*(\w+))")

So I tried escaping the double quotes, like this

Dim reg As Regex = New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")

but this cleared my string out to a couple of spaces when I did a
replace.

Any chance of a small snippet to get me on the right track, using the
Match object?

Thanks very much.

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which
match the Match object holds.

You just have to love Regular Expressions.

--Robby




Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am
back to processing each part of the string character by character so
that I match double quotes correctly, and this is what I was trying to
avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do
not know it yet.

But please, keep the suggestions flowing.

Charles


Charles,

I was looking at the problem, I was thinking will I give my answer
because it is so difficult to describe. Than I saw that it was you.
Therefore it should not be a problem.

In this kind of situations I replace the spaces I will not use for an
absolute unused character.

Do the split

And replace the unused character again back for a space

I assume that this is for you more than enough explanation.

And now you read this you say, I knew that as well.

:)))

Cor

"Charles Law" <[email protected]>

...
I have a string similar to the following:

" MyString 40 "Hello world" all "

It contains white space that may be spaces or tabs, or a combination,
and I want to produce an array with the following elements

arr(0) = "MyString"
arr(1) = 40
arr(2) = "Hello world"
arr(3) = "all"

Using trim and a regular expression ("\s+"), I can reduce my string
to

"MyString 40 "Hello world" all"

and with Split I can get

arr(0) = "MyString"
arr(1) = 40
arr(2) = ""Hello"
arr(3) = "world""
arr(4) = "all"

As you can see, it is not quite what I need. The spaces in "Hello
world" have been reduced to a single space, and Split does not
respect the double quotes, and splits "Hello world" over two
elements.

Does anyone have an idea how I could do this? I could process the
string character by character, but I am hoping that there is a
straight-forward technique for doing it, without looping, and using
some of the techniques I already have.

TIA

Charles
 
R

Robby

Hummm ... It took me two tries to get this Regular Expression. I don't
have the oportunity to use Regex a lot but I quite like them and dive in if
I have a spare moment. I'd say 4 to 6 minutes to solve it. In the old VB6
days I would have done a Find to get my qoute indexes, split with the double
qoute and resplit those outside the double qoute. Then trimmed them. That
would take more time to code and check for errors.

Robby


Cor Ligthert said:
Charles,

Can you give us an idea at the end how much time it took to find the regex
and how much time the straight forward technique and than as well a test
what will be the less time consuming method.

To get a good idea about the discussions using Regex or Straight forward,
I looked at it, and I think Straight forward should take me probably less
than 30 minutes, so for you probably less than 15.

:)

Cor

Charles Law said:
Hi Robby

Thanks for the reply. I am not sure that I understand the regular
expression
(\s*"([\s\w]*)")|(\s*(\w+))

I tried the following, but of course it gives a syntax error because of
the embedded double quotes:

Dim reg As Regex = New Regex("(\s*"([\s\w]*)")|(\s*(\w+))")

So I tried escaping the double quotes, like this

Dim reg As Regex = New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")

but this cleared my string out to a couple of spaces when I did a
replace.

Any chance of a small snippet to get me on the right track, using the
Match object?

Thanks very much.

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which
match the Match object holds.

You just have to love Regular Expressions.

--Robby




Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am
back to processing each part of the string character by character so
that I match double quotes correctly, and this is what I was trying to
avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do
not know it yet.

But please, keep the suggestions flowing.

Charles


Charles,

I was looking at the problem, I was thinking will I give my answer
because it is so difficult to describe. Than I saw that it was you.
Therefore it should not be a problem.

In this kind of situations I replace the spaces I will not use for an
absolute unused character.

Do the split

And replace the unused character again back for a space

I assume that this is for you more than enough explanation.

And now you read this you say, I knew that as well.

:)))

Cor

"Charles Law" <[email protected]>

...
I have a string similar to the following:

" MyString 40 "Hello world" all "

It contains white space that may be spaces or tabs, or a combination,
and I want to produce an array with the following elements

arr(0) = "MyString"
arr(1) = 40
arr(2) = "Hello world"
arr(3) = "all"

Using trim and a regular expression ("\s+"), I can reduce my string
to

"MyString 40 "Hello world" all"

and with Split I can get

arr(0) = "MyString"
arr(1) = 40
arr(2) = ""Hello"
arr(3) = "world""
arr(4) = "all"

As you can see, it is not quite what I need. The spaces in "Hello
world" have been reduced to a single space, and Split does not
respect the double quotes, and splits "Hello world" over two
elements.

Does anyone have an idea how I could do this? I could process the
string character by character, but I am hoping that there is a
straight-forward technique for doing it, without looping, and using
some of the techniques I already have.

TIA

Charles
 
C

Cor Ligthert

Charles and Robby,

I was meaning "thinking, writing and performance".

Because I had said it, I had the idea I should do it.

This is the first time I see a regex faster than a straight foreward loop.

Imports System.Text.RegularExpressions
Public Class Hello
Public Shared Sub main()
Dim start As Integer = Environment.TickCount
For y As Integer = 0 To 10000
Dim mystring As String = _
" MyString 40 ""Hello world"" all"
Dim myarrlist As New ArrayList
Dim endWord As Integer
For i As Integer = 0 To mystring.Length - 1
If mystring.Substring(i, 1) <> " " Then
If mystring.Substring(i, 1) = """" Then
endWord = mystring.Substring(i + 1).IndexOf("""") +
1
myarrlist.Add(mystring.Substring(i + 1, endWord -
1))
i = i + endWord
Else
endWord = mystring.Substring(i).IndexOf(" ")
If endWord = -1 Then endWord = mystring.Length - i
myarrlist.Add(mystring.Substring(i, endWord))
i = i + endWord
End If
End If
Next
Next
Console.WriteLine((Environment.TickCount - start).ToString)
'Regex
start = Environment.TickCount
For y As Integer = 0 To 10000
Dim testString As String = "MyString 40 ""Hello world""
all "
Dim rePost As New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")
Dim allMatches As MatchCollection = rePost.Matches(testString)
Next
Console.WriteLine((Environment.TickCount - start).ToString)
End Sub
End Class
///
The speed is for me about 3:2 where I did not really study to get the loop
faster, because is should be done in a short time as I said.

So Charles next time you have to message "Robby can you help me again?"
Because it is of course a wonderfull thing done with Regex what Robby did.

:)

Cor

Charles Law said:
Cor

Do you mean how long does it take to parse the string using RegEx against
parsing it character by character, or how long did it take to come up with
the solution?

I think the RegEx solution is by far the neatest, and most flexible. It is
also fewer lines of code. I do not have a solution parsing
character-by-character, so I cannot measure how long to create or run, but
I think you have it about right.

Charles


Cor Ligthert said:
Charles,

Can you give us an idea at the end how much time it took to find the
regex and how much time the straight forward technique and than as well a
test what will be the less time consuming method.

To get a good idea about the discussions using Regex or Straight forward,
I looked at it, and I think Straight forward should take me probably less
than 30 minutes, so for you probably less than 15.

:)

Cor

Charles Law said:
Hi Robby

Thanks for the reply. I am not sure that I understand the regular
expression

(\s*"([\s\w]*)")|(\s*(\w+))

I tried the following, but of course it gives a syntax error because of
the embedded double quotes:

Dim reg As Regex = New Regex("(\s*"([\s\w]*)")|(\s*(\w+))")

So I tried escaping the double quotes, like this

Dim reg As Regex = New Regex("(\s*""([\s\w]*)"")|(\s*(\w+))")

but this cleared my string out to a couple of spaces when I did a
replace.

Any chance of a small snippet to get me on the right track, using the
Match object?

Thanks very much.

Charles



Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes
or the word token without the whitespace characters depending on which
match the Match object holds.

You just have to love Regular Expressions.

--Robby




Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am
back to processing each part of the string character by character so
that I match double quotes correctly, and this is what I was trying to
avoid.

Perhaps there is a regex expression that will match double quotes, or
a method that parses a string taking these into account, but sadly I
do not know it yet.

But please, keep the suggestions flowing.

Charles


Charles,

I was looking at the problem, I was thinking will I give my answer
because it is so difficult to describe. Than I saw that it was you.
Therefore it should not be a problem.

In this kind of situations I replace the spaces I will not use for an
absolute unused character.

Do the split

And replace the unused character again back for a space

I assume that this is for you more than enough explanation.

And now you read this you say, I knew that as well.

:)))

Cor

"Charles Law" <[email protected]>

...
I have a string similar to the following:

" MyString 40 "Hello world" all "

It contains white space that may be spaces or tabs, or a
combination, and I want to produce an array with the following
elements

arr(0) = "MyString"
arr(1) = 40
arr(2) = "Hello world"
arr(3) = "all"

Using trim and a regular expression ("\s+"), I can reduce my string
to

"MyString 40 "Hello world" all"

and with Split I can get

arr(0) = "MyString"
arr(1) = 40
arr(2) = ""Hello"
arr(3) = "world""
arr(4) = "all"

As you can see, it is not quite what I need. The spaces in "Hello
world" have been reduced to a single space, and Split does not
respect the double quotes, and splits "Hello world" over two
elements.

Does anyone have an idea how I could do this? I could process the
string character by character, but I am hoping that there is a
straight-forward technique for doing it, without looping, and using
some of the techniques I already have.

TIA

Charles
 
C

Charles Law

Robby

I have just come across a valid (in my context) string that is split into
too many matches. The string is

"PartA PartB PartC(plus) PartD"

The regex breaks it into

PartA
PartB
PartC
plus
PartD

Can you see a refinement for the regular expression to keep PartC(plus) as
one element?

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which match
the Match object holds.

You just have to love Regular Expressions.

--Robby




Charles Law said:
Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am back
to processing each part of the string character by character so that I
match double quotes correctly, and this is what I was trying to avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do not
know it yet.

But please, keep the suggestions flowing.

Charles
 
C

Charles Law

Also, despite what I said earlier to Cor, # could appear in my string, and
it also causes a split when I don't want it to. In fact, the only characters
that should cause a split (outside matched double quotes) are

space, tab, CR, LF, FF, or other control characters

Any ideas?

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which match
the Match object holds.

You just have to love Regular Expressions.

--Robby




Charles Law said:
Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am back
to processing each part of the string character by character so that I
match double quotes correctly, and this is what I was trying to avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do not
know it yet.

But please, keep the suggestions flowing.

Charles
 
C

Charles Law

I think I have it. I have changed the pattern to

(\s*""([\s\w\x23\x28\x29]*)"")|(\s*([\w\x23\x28\x29]+))

which seems to do the trick, unless anyone can spot a flaw in this.

Charles


Charles Law said:
Also, despite what I said earlier to Cor, # could appear in my string, and
it also causes a split when I don't want it to. In fact, the only
characters that should cause a split (outside matched double quotes) are

space, tab, CR, LF, FF, or other control characters

Any ideas?

Charles


Robby said:
Try

(\s*"([\s\w]*)")|(\s*(\w+))

Then do a Replace on each Match object with

$2$4

This will return either your double qouted string with out the qoutes or
the word token without the whitespace characters depending on which match
the Match object holds.

You just have to love Regular Expressions.

--Robby




Charles Law said:
Hi Cor

You read my mind ;-)

I had thought of using something like #, as it will never occur in my
string. But then I started to look at how I would know which spaces to
replace with #, and which to leave. Of course, to the human eye it is
obvious that I only replace the spaces between " and ", but now I am
back to processing each part of the string character by character so
that I match double quotes correctly, and this is what I was trying to
avoid.

Perhaps there is a regex expression that will match double quotes, or a
method that parses a string taking these into account, but sadly I do
not know it yet.

But please, keep the suggestions flowing.

Charles


Charles,

I was looking at the problem, I was thinking will I give my answer
because it is so difficult to describe. Than I saw that it was you.
Therefore it should not be a problem.

In this kind of situations I replace the spaces I will not use for an
absolute unused character.

Do the split

And replace the unused character again back for a space

I assume that this is for you more than enough explanation.

And now you read this you say, I knew that as well.

:)))

Cor

"Charles Law" <[email protected]>

...
I have a string similar to the following:

" MyString 40 "Hello world" all "

It contains white space that may be spaces or tabs, or a combination,
and I want to produce an array with the following elements

arr(0) = "MyString"
arr(1) = 40
arr(2) = "Hello world"
arr(3) = "all"

Using trim and a regular expression ("\s+"), I can reduce my string to

"MyString 40 "Hello world" all"

and with Split I can get

arr(0) = "MyString"
arr(1) = 40
arr(2) = ""Hello"
arr(3) = "world""
arr(4) = "all"

As you can see, it is not quite what I need. The spaces in "Hello
world" have been reduced to a single space, and Split does not respect
the double quotes, and splits "Hello world" over two elements.

Does anyone have an idea how I could do this? I could process the
string character by character, but I am hoping that there is a
straight-forward technique for doing it, without looping, and using
some of the techniques I already have.

TIA

Charles
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top