Regex Help

C

Christopher Robin

I'm inserting a SharePoint List into a SQL Database, but some of the text has
oddly formed HTML tags. I want to remove these tags with a regular
expression, but I'm having some difficulty. My code is below.

Imports System
Imports System.Net
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.Xml
Imports SharePointServices
Imports SharePointServices.NorthwindSync
Imports System.Text.RegularExpressions
Imports System.IO

Public Class ScriptMain

Public Sub Main()

Dim DocLoc As String
Dim TextDoc As TextWriter
Dim listService As New Lists()
Dim node As XmlNode
Dim strHtmlString As String
Dim pattern As String = "<[/]?(font|span|div|del|ins|color:\w+)[^>]*?"

DocLoc = "\\MYSERVER\MyFolder\MyFile.xml"

listService.PreAuthenticate = True
listService.Credentials = CredentialCache.DefaultNetworkCredentials

Try

node = ListHelper.GetAllListItems(listService, "My List Name")
strHtmlString = node.InnerXml()
Regex.Replace(strHtmlString, pattern, String.Empty,
RegexOptions.IgnoreCase).Trim()

TextDoc = File.CreateText(DocLoc)
TextDoc.WriteLine(strHtmlString)
TextDoc.Flush()
TextDoc.Close()

Catch ex As Exception

'Raise the error again and the result to failure.
Dts.Events.FireError(1, ex.TargetSite.ToString(), ex.Message,
"", 0)
Dts.TaskResult = Dts.Results.Failure

End Try

Dts.TaskResult = Dts.Results.Success

End Sub

End Class

And here are a few samples of what I'm tryig to remove with the Regex.

"<div></div>"
"<font size=2 color="#1F497D">"
"</font><br>&nbsp;"

Any help would be greatly appreciated.

Thanks,
Chris
 
J

Jesse Houwing

Hello Christopher,

What is it that isn't working right now? It looks like you're nearly there.

Your pattern isn't what I'd make of it, try the following if that's what's
currently bothering you:

</?(?:font|span|div|more tags here)[^>]*>

And there seems to be a little error in your code: Regex.Replace doesn't
alter the original string (strings are immutable in .NET), but it returns
a new string instead, so the following code needs to be changed:

strHtmlString = node.InnerXml()
strHtmlString = Regex.Replace(strHtmlString, pattern, String.Empty,RegexOptions.IgnoreCase).Trim()

If that doesn't work, then please explain what it is that isn't working :).

Jesse
I'm inserting a SharePoint List into a SQL Database, but some of the
text has oddly formed HTML tags. I want to remove these tags with a
regular expression, but I'm having some difficulty. My code is below.

Imports System
Imports System.Net
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.Xml
Imports SharePointServices
Imports SharePointServices.NorthwindSync
Imports System.Text.RegularExpressions
Imports System.IO
Public Class ScriptMain

Public Sub Main()

Dim DocLoc As String
Dim TextDoc As TextWriter
Dim listService As New Lists()
Dim node As XmlNode
Dim strHtmlString As String
Dim pattern As String =
"<[/]?(font|span|div|del|ins|color:\w+)[^>]*?"
DocLoc = "\\MYSERVER\MyFolder\MyFile.xml"

listService.PreAuthenticate = True
listService.Credentials =
CredentialCache.DefaultNetworkCredentials
Try

node = ListHelper.GetAllListItems(listService, "My List
Name")
strHtmlString = node.InnerXml()
Regex.Replace(strHtmlString, pattern, String.Empty,
RegexOptions.IgnoreCase).Trim()
TextDoc = File.CreateText(DocLoc)
TextDoc.WriteLine(strHtmlString)
TextDoc.Flush()
TextDoc.Close()
Catch ex As Exception

'Raise the error again and the result to failure.
Dts.Events.FireError(1, ex.TargetSite.ToString(),
ex.Message,
"", 0)
Dts.TaskResult = Dts.Results.Failure
End Try

Dts.TaskResult = Dts.Results.Success

End Sub

End Class

And here are a few samples of what I'm tryig to remove with the Regex.

"<div></div>"
"<font size=2 color="#1F497D">"
"</font><br>&nbsp;"
Any help would be greatly appreciated.

Thanks,
Chris
 
Top