HTMLdecode for ACC97

O

Olivier Delrieu

Dear All,

I am transfering XML data from web pages to an ACC97 table. FYI, I use
wininet.dll APIs as suggested in :
http://support.microsoft.com/default.aspx?scid=kb;en-us;232194

Here is the kind of XML data I want to transfer and process :
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=XML&list_uids=7442

Of course, the data dumped into the string look like this :
<Date-std_year>2003</Date-std_year>
instead of :
<Date-std_year>2003</Date-std_year>

So, in which reference could I find the HTMLdecode function (or anything
equivalent that could do the trick) ? I think it is installed in IIS/ASP,
but here I am client side, not server side ....

FYI, if it is unclear, here is the description of this function, installed
in the Longhorn server:
http://msdn.microsoft.com/library/d...mwebhttpserverutilityclasshtmldecodetopic.asp

I have written a procedure, but it is slow, dirty and I am discovering every
day a new code &something; ...

Thanks,

Olivier.
 
M

Mike D Sutton

I am transfering XML data from web pages to an ACC97 table. FYI, I use
wininet.dll APIs as suggested in :
http://support.microsoft.com/default.aspx?scid=kb;en-us;232194

Here is the kind of XML data I want to transfer and process :
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=XML&list_uids=7442

Of course, the data dumped into the string look like this :
&lt;Date-std_year&gt;2003&lt;/Date-std_year&gt;
instead of :
<Date-std_year>2003</Date-std_year>

So, in which reference could I find the HTMLdecode function (or anything
equivalent that could do the trick) ? I think it is installed in IIS/ASP,
but here I am client side, not server side ....

FYI, if it is unclear, here is the description of this function, installed
in the Longhorn server:
http://msdn.microsoft.com/library/d...mwebhttpserverutilityclasshtmldecodetopic.asp

I have written a procedure, but it is slow, dirty and I am discovering every
day a new code &something; ...

Rather than parsing it yourself you can simply use the Microsoft XML library ("Project -> References -> Microsoft XML, v#.#").
Since your XML is embedded within an HTML page you'll have to download it and strip the XML data out rather than getting MSXML to do
it for you, however assuming you've written it to a file (without the DOCTYPE tag which will cause problems with the parser when
coming from a local source) you can use this code:

'***
Dim DOMDoc As DOMDocument
Dim IDNode As MSXML2.IXMLDOMNode

Const GeneID As String = "Entrezgene_track-info/Gene-track/Gene-track_geneid"
Const FilePath As String = "X:\SourceXML.xml" ' Set your data path here

Set DOMDoc = New DOMDocument

DOMDoc.async = False
If (DOMDoc.Load(FilePath)) Then
Set IDNode = DOMDoc.documentElement.selectSingleNode(GeneID)

If (IDNode Is Nothing) Then
Call MsgBox("Can't find gene ID!", vbCritical, "Gene ID")
Else
Call MsgBox("Gene ID: " & IDNode.Text, vbInformation, "Gene ID")
End If
Else
Call MsgBox(DOMDoc.parseError.reason, vbCritical, "XML parse error")
End If

Set DOMDoc = Nothing
'***

This attempts to find the Gene ID from the XML file as an example, but you can traverse the file as you wish. You also needn't
write the data out to disk, just use the LoadXML() rather than Load() method of the DOMDocument object to load the data from a
string rather than from disk.
Here's a couple of older posts with more sample code for using the library:
http://groups.google.co.uk/[email protected]
http://groups.google.co.uk/groups?th=9fda2172e8ac1ad1
Hope this helps,

Mike


- Microsoft Visual Basic MVP -
E-Mail: (e-mail address removed)
WWW: Http://www.mvps.org/EDais/
 
O

Olivier Delrieu

Hi Mike,

Thanks for your answer. Unfortunately this does not help. I am already using
the XML library to parse the data, sorry if I was not clear. I am using the
same kind of code you described :

domdoc.loadXML(sDump)

where sDump is the string coming from the function (I call it dumpURL)
described in :
http://support.microsoft.com/default.aspx?scid=kb;en-us;232194

sDump looks like this :

<html>....
<body> ... blabla...
<pre>
....
&lt;Date-std_year&gt;2003&lt;/Date-std_year&gt;
....
</pre>
....blabla....
</body>
</html>

the XML code is between the <pre> tags. If the "&lt;" and "&gt;" are not
decoded then the function domdoc.loadXML(sDump) does not work. Everythings
works fine if I use something like:

sURL=
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=XML&list_uids=7442
sDump = dumpUrl(sURL)
sDump = ExtractTheStringBetweenPreTags(sDump)
sDump = HTMLdecode(sDump)
domdoc.loadXML(sDump)
....

If I do not include the home made HTMLdecode() function this does not work
.... I am just in need of an official HTMLdecode function....

Olivier.

Mike D Sutton said:
http://msdn.microsoft.com/library/d...mwebhttpserverutilityclasshtmldecodetopic.asp

Rather than parsing it yourself you can simply use the Microsoft XML
library ("Project -> References -> Microsoft XML, v#.#").
Since your XML is embedded within an HTML page you'll have to download it
and strip the XML data out rather than getting MSXML to do
it for you, however assuming you've written it to a file (without the
DOCTYPE tag which will cause problems with the parser when
coming from a local source) you can use this code:

'***
Dim DOMDoc As DOMDocument
Dim IDNode As MSXML2.IXMLDOMNode

Const GeneID As String = "Entrezgene_track-info/Gene-track/Gene-track_geneid"
Const FilePath As String = "X:\SourceXML.xml" ' Set your data path here

Set DOMDoc = New DOMDocument

DOMDoc.async = False
If (DOMDoc.Load(FilePath)) Then
Set IDNode = DOMDoc.documentElement.selectSingleNode(GeneID)

If (IDNode Is Nothing) Then
Call MsgBox("Can't find gene ID!", vbCritical, "Gene ID")
Else
Call MsgBox("Gene ID: " & IDNode.Text, vbInformation, "Gene ID")
End If
Else
Call MsgBox(DOMDoc.parseError.reason, vbCritical, "XML parse error")
End If

Set DOMDoc = Nothing
'***

This attempts to find the Gene ID from the XML file as an example, but you
can traverse the file as you wish. You also needn't
write the data out to disk, just use the LoadXML() rather than Load()
method of the DOMDocument object to load the data from a
 
M

Mike D Sutton

Thanks for your answer. Unfortunately this does not help. I am already using
the XML library to parse the data, sorry if I was not clear. I am using the
same kind of code you described :

domdoc.loadXML(sDump)

where sDump is the string coming from the function (I call it dumpURL)
described in :
http://support.microsoft.com/default.aspx?scid=kb;en-us;232194

sDump looks like this :

<html>....
<body> ... blabla...
<pre>
...
&lt;Date-std_year&gt;2003&lt;/Date-std_year&gt;
...
</pre>
...blabla....
</body>
</html>

the XML code is between the <pre> tags. If the "&lt;" and "&gt;" are not
decoded then the function domdoc.loadXML(sDump) does not work. Everythings
works fine if I use something like:

sURL=
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=XML&list_uids=7442
sDump = dumpUrl(sURL)
sDump = ExtractTheStringBetweenPreTags(sDump)
sDump = HTMLdecode(sDump)
domdoc.loadXML(sDump)
...

If I do not include the home made HTMLdecode() function this does not work
... I am just in need of an official HTMLdecode function....

Here's one method you could use, based on the MSDN article you cited in your original post, add these:

'***
Dim XMLStart As Long, XMLEnd As Long
Dim DTStart As Long, DTEnd As Long
Dim FullXML As String

Const XMLTrigger As String = "&lt;?xml version=&quot;1.0&quot;?&gt;"
Const XMLTerminator As String = "&lt;/Entrezgene&gt;"
Const DTTRigger As String = "&lt;!DOCTYPE"
Const DTTerminator As String = "&gt;"
'***

Now change the url to be the one you're trying to download i.e:

'***
sUrl = "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=XML&list_uids=7442"
'***

Then replace the file writing with this:

'***
XMLStart = InStr(1, sBuffer, XMLTrigger)
XMLEnd = InStr(XMLStart + 1, sBuffer, XMLTerminator) + Len(XMLTerminator)
DTStart = InStr(XMLStart + 1, sBuffer, DTTRigger)
DTEnd = InStr(DTStart + 1, sBuffer, DTTerminator) + Len(DTTerminator)

If ((XMLEnd > XMLStart) And (DTEnd > DTStart)) Then
Call DeXML( _
Mid$(sBuffer, XMLStart, DTStart - XMLStart) & _
Mid$(sBuffer, DTEnd, XMLEnd - DTEnd))
End If
'***

And finally, the DeXML() function looks like this:

'***
Private Sub DeXML(ByRef inXML As String)
Dim XMLDom As MSXML2.DOMDocument

Set XMLDom = New MSXML2.DOMDocument

If (XMLDom.loadXML("<root>" & inXML & "</root>")) Then
If (XMLDom.loadXML(XMLDom.firstChild.Text)) Then
Debug.Print XMLDom.xml ' Use XML here...
Else
Debug.Print "2nd stage parse error"
End If
Else
Debug.Print "1st stage parse error"
End If

Set XMLDom = Nothing
End Sub
'***

The parsed XML will be written to the debug window. It's not the prettiest or most robust way of achieving this but it should work
as long as the document format remains fairly static.
Hope this helps,

Mike


- Microsoft Visual Basic MVP -
E-Mail: (e-mail address removed)
WWW: Http://www.mvps.org/EDais/
 
O

Olivier Delrieu

Hi Mike,

Thanks for your answer. Unfortunately this does not help. I am already using
the XML library to parse the data, sorry if I was not clear. I am using the
same kind of code you described :

domdoc.loadXML(sDump)

where sDump is the string coming from the function (I call it dumpURL)
described in :
http://support.microsoft.com/default.aspx?scid=kb;en-us;232194

sDump looks like this :

<html>....
<body> ... blabla...
<pre>
....
&lt;Date-std_year&gt;2003&lt;/Date-std_year&gt;
....
</pre>
....blabla....
</body>
</html>

As you can see, the XML code is between the <pre> tags. It has been coded by
the server to avoid the XML tags to be interpreted as HTML tags.

If the "&lt;" and "&gt;" are not decoded then the function
domdoc.loadXML(sDump) does not work. Everythings works fine if I use
something like:

sURL=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt
=XML&list_uids=7442
sDump = dumpUrl(sURL)
sDump = ExtractTheStringBetweenPreTags(sDump)
sDump = HTMLdecode(sDump)
domdoc.loadXML(sDump)
....

If I do not include the home made HTMLdecode() function this does not work
.... I am just in need of an official HTMLdecode function....

Olivier.

Mike D Sutton said:
http://msdn.microsoft.com/library/d...mwebhttpserverutilityclasshtmldecodetopic.asp

Rather than parsing it yourself you can simply use the Microsoft XML
library ("Project -> References -> Microsoft XML, v#.#").
Since your XML is embedded within an HTML page you'll have to download it
and strip the XML data out rather than getting MSXML to do
it for you, however assuming you've written it to a file (without the
DOCTYPE tag which will cause problems with the parser when
coming from a local source) you can use this code:

'***
Dim DOMDoc As DOMDocument
Dim IDNode As MSXML2.IXMLDOMNode

Const GeneID As String = "Entrezgene_track-info/Gene-track/Gene-track_geneid"
Const FilePath As String = "X:\SourceXML.xml" ' Set your data path here

Set DOMDoc = New DOMDocument

DOMDoc.async = False
If (DOMDoc.Load(FilePath)) Then
Set IDNode = DOMDoc.documentElement.selectSingleNode(GeneID)

If (IDNode Is Nothing) Then
Call MsgBox("Can't find gene ID!", vbCritical, "Gene ID")
Else
Call MsgBox("Gene ID: " & IDNode.Text, vbInformation, "Gene ID")
End If
Else
Call MsgBox(DOMDoc.parseError.reason, vbCritical, "XML parse error")
End If

Set DOMDoc = Nothing
'***

This attempts to find the Gene ID from the XML file as an example, but you
can traverse the file as you wish. You also needn't
write the data out to disk, just use the LoadXML() rather than Load()
method of the DOMDocument object to load the data from a
 
A

AlphaGremlin

Have you tried using the WebBrowser control to open the page, then enumerate
all the Pre tags in it and simply use the innerText value?

AlphaGremlin
 
L

Larry Serflaten

Olivier Delrieu said:
Hi Mike,

Thanks for your answer. Unfortunately this does not help. I am already using
the XML library to parse the data, sorry if I was not clear. I am using the
same kind of code you described :
sDump = dumpUrl(sURL)
sDump = ExtractTheStringBetweenPreTags(sDump)
sDump = HTMLdecode(sDump)
domdoc.loadXML(sDump)


If your already using MS's XML package, then why not go in waist deep
and use their WebBrowser control to load the URL and extract the text?

For a terse example, with a WebBrowser (called WB) and Text1 (with
its Mulitiline property set) on a form, try out the following code to see if
it will get you close to where you want to be...

Private Sub Form_Load()
If Text1.MultiLine Then
WB.Navigate "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=XML&list_uids=7442"
Else
MsgBox "Set the textbox MULTILINE property first!"
End If
End Sub

Private Sub WB_DocumentComplete(ByVal pDisp As Object, URL As Variant)
Text1.Text = WB.Document.All.Tags("pre").Item(0).innerText
End Sub


HTH
LFS
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top