regular expressions

J

JFB

Hi All,
What is the pattern for a regular expression if i want to get the first
paragraph in a string between "<b>" tag?
String = "<b>match sample test<b><b>match2 sample2 test2<b>"
Get Result = "match sample test"
Thanks

JFB
 
J

JFB

BTW
With this pattern I got all the text between the first and the last <br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>
Thanks
 
M

Martin Honnen

JFB said:
BTW
With this pattern I got all the text between the first and the last <br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>

Try non-greedy matching with
.*?
instead of
.*
Or use
[^<]*
 
J

JFB

Thanks for you reply and help.
using <(?<tag>\w*)>(?<text>[^<]*)<(?<tag>\w*)>
It jumps and get the last one

"match2 sample2 test2"
any other idea?


Martin Honnen said:
JFB said:
BTW
With this pattern I got all the text between the first and the last <br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>

Try non-greedy matching with
.*?
instead of
.*
Or use
[^<]*
 
J

JFB

Never mine, looks like the string is different.. now I'm confuse.
String= "<br><br>match sample test<br>match2 sample2 test2<br>"
How can I get the result?
result= "match sample test"
Thanks!




JFB said:
Thanks for you reply and help.
using <(?<tag>\w*)>(?<text>[^<]*)<(?<tag>\w*)>
It jumps and get the last one

"match2 sample2 test2"
any other idea?


Martin Honnen said:
JFB said:
BTW
With this pattern I got all the text between the first and the last <br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>

Try non-greedy matching with
.*?
instead of
.*
Or use
[^<]*
 
J

JFB

This is getting better :), the string now is
String = ""<br /><br />match sample test<br />match2 sample2 test2<br />"
Please help, now i can't match at all.
Thanks



JFB said:
Never mine, looks like the string is different.. now I'm confuse.
String= "<br><br>match sample test<br>match2 sample2 test2<br>"
How can I get the result?
result= "match sample test"
Thanks!




JFB said:
Thanks for you reply and help.
using <(?<tag>\w*)>(?<text>[^<]*)<(?<tag>\w*)>
It jumps and get the last one

"match2 sample2 test2"
any other idea?


Martin Honnen said:
JFB wrote:
BTW
With this pattern I got all the text between the first and the last
<br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>

Try non-greedy matching with
.*?
instead of
.*
Or use
[^<]*
 
M

Martin Honnen

JFB said:
This is getting better :), the string now is
String = ""<br /><br />match sample test<br />match2 sample2 test2<br />"
Please help, now i can't match at all.

That looks like an XML fragment now so you could parse it as XML e.g.

Dim xml As String = "<br /><br />match sample test<br />match2
sample2 test2<br />"
Dim settings As New XmlReaderSettings()
settings.ConformanceLevel = ConformanceLevel.Fragment
Dim doc As New XPathDocument(XmlReader.Create(New
StringReader(xml), settings))
Dim text As XPathNavigator =
doc.CreateNavigator().SelectSingleNode("br/following-sibling::text()")
If text IsNot Nothing Then
Console.WriteLine(text.Value)
End If

would output "match sample test".

Your earlier samples however were not XML fragments or documents so the
above approach would not work with them.
But if you know the input is an XML document or fragment then I wouldn't
bother to try to parse it with regular expressions but instead exploit
the power of XPath.
 
E

eBob.com

If you don't have it, get Expresso from UltraPico. It's a FREE tool which
makes it very easy to experiment with regular expressions.

Bob
 
J

JFB

Thanks again for you reply and help.
When I run this code. I'm getting this error:
' ', hexadecimal value 0x0B

Looks like the data from this doc file is not correct, but I open the word
file in notepad and looks ok with html format.
Maybe xml have problem reading my text?
The <br \> shows as square.
Do you have an idea how to solve this?
Regards

J:)hnny
 
M

Martin Honnen

JFB said:
When I run this code. I'm getting this error:
' ', hexadecimal value 0x0B

Looks like the data from this doc file is not correct, but I open the word
file in notepad and looks ok with html format.
Maybe xml have problem reading my text?
The <br \> shows as square.
Do you have an idea how to solve this?

Which code exactly do you run that gives that error for which statement
eaxctly? How does the input exactly look? Does it contain characters
that are not allowed in XML, such as control characters?

So far you have shown only variables with strings of markup.
If you have a file instead then you will need to show how you read the
file contents into a string respectively in terms of XML you would
normally let the XML parser do all that work meaning if you have a file
file1.xml then you would simply change the code I posted to

Dim settings As New XmlReaderSettings()
settings.ConformanceLevel = ConformanceLevel.Fragment
Dim doc As New XPathDocument(XmlReader.Create("file1.xml",
settings))
Dim text As XPathNavigator =
doc.CreateNavigator().SelectSingleNode("br/following-sibling::text()")
If text IsNot Nothing Then
Console.WriteLine(text.Value)
End If

If you still have problems then you need to provide more details as to
where the file comes from, how it is encoded.
 
J

JFB

Which code exactly do you run that gives that error for which statement
Error:{"' ', hexadecimal value 0x0B, is an invalid character. Line 1,
position 1."}
Line Code when the error show:
Dim doc As New XPath.XPathDocument(XmlReader.Create(New
StringReader(tempcontent), settings))
How does the input exactly look?
I have a word doc file that I need to read and get the name of address
block.
The paragraph looks like this when I edit the file with notepad.
<br /><br />

SHLOMI HELWA<br />

563 ELTINGVILLE BLVD.<br />

STATEN ISLAND, NY 10312<br />

<br />

<br />

<br />

Does it contain characters that are not allowed in XML, such as control
characters?
So far you have shown only variables with strings of markup.
If you have a file instead then you will need to show how you read the
file contents into a string respectively in terms of XML you would
normally let the XML parser do all that work meaning if you have a file
file1.xml then you would simply change the code I posted to
Please send me an email to jfb00(at)hotmail.com and I can send you the word
file.
I have many word files that I need to collect only the name of an address
block, so I reading and getting the paragraph that contains the address
block.
Here is my code:
Try

'for office xp

wordApp = CreateObject("Word.Application")

wordDoc = CreateObject("Word.document")

Catch

'for office 2000 and 97

wordApp = New Word.Application

wordDoc = New Word.Document

End Try

wordApp.Visible = False

wordDoc = wordApp.Documents.Open(FileName:=docName.ToString)



Dim tempcontent As String = ""

Dim subPara As Word.Paragraph

Dim paraCount As Integer

paraCount = 0

For Each subPara In wordDoc.Paragraphs

tempcontent = subPara.Range.Text

paraCount = paraCount + 1

If paraCount = 5 Then ''Here I get the address block

Exit For

End If

Next

Dim settings As New XmlReaderSettings()

settings.ConformanceLevel = ConformanceLevel.Fragment

settings.CheckCharacters = True

Dim doc As New XPath.XPathDocument(XmlReader.Create(New
StringReader(tempcontent), settings))

Dim text As XPath.XPathNavigator =
doc.CreateNavigator().SelectSingleNode("br/following-sibling::text()")

If text IsNot Nothing Then

MsgBox(text.Value)

End If



thanks for your help!
 
J

JFB

Thanks for you reply Bob,
I already get that but it doesn't help in my case because I have some
special character in my file.
Rgds
 
M

Martin Honnen

JFB said:
Error:{"' ', hexadecimal value 0x0B, is an invalid character. Line 1,
position 1."}
Line Code when the error show:
Dim doc As New XPath.XPathDocument(XmlReader.Create(New
StringReader(tempcontent), settings))

I have a word doc file that I need to read and get the name of address
block.

I am afraid a Word document can contain characters that are not allowed
in XML documents so using an XML parser on the contents will not work
unless you strip any not allowed characters first.
 
J

JFB

I used arrays and it works:
Dim ArrayCadenas() As String

ArrayCadenas = Split("<br \><br \>SHLOMI HELWA<br \>563 ELTINGVILLE BLVD.<br
\>STATEN ISLAND, NY 10312<br \><br \>","<br \>")

msbBox(ArrayCadenas(0).ToString)

Thanks for you reply and help!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top