Get url for pdf file from AxSHDocVw.AxWebBrowser

  • Thread starter Thread starter LN Mike
  • Start date Start date
L

LN Mike

VB.Net 2003
My "Collect PDF" app collects pdf files from several web sites, most of the
web sites provide a src tag in the web page before the pdf is displayed in my
web control (AxSHDocVw.AxWebBrowser). Once my app has the url, it downloads
the file and stores it.

The problem, some web sites don't provide the web page with a src tag that
has the url of the pdf file. So, how do I get the url of the pdf file if the
web site doesn't give me the web page with the src tag? Also, I run 9
"Collect PDF" apps on each pc, so reading the cache is not a good idea.
 
Not easy as a webpage is nothing else then a DOM document, for which is the
MSHTML class.

No easy stuff and needs a lot of casting (easier to start with option strict
of and than afterward set it on to correct that)

http://msdn.microsoft.com/en-us/library/aa741317.aspx

You should not set an import to it, but fully describe it every time as
everything becomes terrible slow in VB Net 2003 as the import is used.

Cor
 
Hello Mike,

When you say "some web sites don't provide the web page with a src tag", do
you mean those "web pages" actually just let the Acrobat reader take over
the entire browser control (the PDF document was fully filled into the
control)? Or those web pages uses other means to embed a PDF file as a
portion of the entire HTML page?

Could you give me a specific sample of the problem you met so I can better
understand the situation and see if we can come out a solution to it?

Thanks,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Some of the "web pages" actually just let the acrobat reader take over the
entire browser control. I put a break point on my brower control
"DocumentComplete" event, so I'm able to see each and every page that comes
through. However with the web sites that I can't get a src tag, I can't open
the document at all, it seems DocumentComplete sends me a pdf file not a html
file, so I can't see outerhtml.

WebDisplay.Document.All(0).outerhtml gives me the following error message

Run-time exception thrown : System.MissingMemberException - Public member
'All' on type 'IAcroAXDocShim' not found.

But with the web sites that do work, I'm able to see the html code that has
the src tag.
 
I not sure, but maybe when the webdisplay.document.all(0).outerhtml doesn't
work, it is the pdf file. So, how to I copy the object (which I think is the
pdf file) to a local folder?

LN Mike said:
Some of the "web pages" actually just let the acrobat reader take over the
entire browser control. I put a break point on my brower control
"DocumentComplete" event, so I'm able to see each and every page that comes
through. However with the web sites that I can't get a src tag, I can't open
the document at all, it seems DocumentComplete sends me a pdf file not a html
file, so I can't see outerhtml.

WebDisplay.Document.All(0).outerhtml gives me the following error message

Run-time exception thrown : System.MissingMemberException - Public member
'All' on type 'IAcroAXDocShim' not found.

But with the web sites that do work, I'm able to see the html code that has
the src tag.




"Jie Wang [MSFT]" said:
Hello Mike,

When you say "some web sites don't provide the web page with a src tag", do
you mean those "web pages" actually just let the Acrobat reader take over
the entire browser control (the PDF document was fully filled into the
control)? Or those web pages uses other means to embed a PDF file as a
portion of the entire HTML page?

Could you give me a specific sample of the problem you met so I can better
understand the situation and see if we can come out a solution to it?

Thanks,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
If the browser control navigates directly to a PDF file, there will not be
a src tag to extract the file name. Instead, we can use the
DocumentComplete event to trap the PDF URL:

Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As
System.EventArgs) Handles MyBase.Load

AxWebBrowser1.Navigate("http://research.microsoft.com/en-us/um/people/cmbish
op/prml/bishop-prml-sample.pdf")
End Sub

Private Sub AxWebBrowser1_DocumentComplete(ByVal sender As System.Object,
ByVal e As AxSHDocVw.DWebBrowserEvents2_DocumentCompleteEvent) Handles
AxWebBrowser1.DocumentComplete
MessageBox.Show(e.uRL.ToString())
End Sub

With the code above, we'll see the MessageBox pops up with the URL of the
PDF file.

Will this then help you to download the PDF file?

Regards,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi Mike,

Any updates on this issue?

If you have any further questions regarding this issue, please kindly let
me know.

Thanks,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi Mike,

Any updates on this issue?

If you have any further questions regarding this issue, please kindly let
me know.

Thanks,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
No, didn't work. The DocumentComplete url is the web page not the URL. I can
see the web sites that work the pdf file has "...
cgi-bin/show_temp.pl?file=pdf18445142694797&type=application/pdf" but the web
sites that don't work only show the web pages like "... doc1/038110817935"

Due to company policy I can't list the full urls, but I did list the ending
of the urls that are the difference.
 
No, didn't work. The DocumentComplete url is the web page not the URL. I can
see the web sites that work the pdf file has "...
cgi-bin/show_temp.pl?file=pdf18445142694797&type=application/pdf" but the web
sites that don't work only show the web pages like "... doc1/038110817935"

Due to company policy I can't list the full urls, but I did list the ending
of the urls that are the difference.
 
Hi,

From the form of the URL we can't really see why some work while others not.

However, I can think of two ways of sending a PDF file to the client side
for it to be displayed in the entire browser control:

1. Write PDF stream directly to the response stream, while setting the
content type to application/pdf.

or, it can

2. Send a HTTP redirect code to the client, so the client is responsible
for re-sending the request to the new URL to get the *real* PDF file.

Now I suspect the "not working" scenario could be cause by the second way.

I'll try to setup an environment to test these two scenarios and see what
exactly happens.

Could you let me know, for the not working URLs like "...
doc1/038110817935", what if you manually navigate to that URL in a IE
browser? Do you get the PDF file?

Thanks,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi,

From the form of the URL we can't really see why some work while others not.

However, I can think of two ways of sending a PDF file to the client side
for it to be displayed in the entire browser control:

1. Write PDF stream directly to the response stream, while setting the
content type to application/pdf.

or, it can

2. Send a HTTP redirect code to the client, so the client is responsible
for re-sending the request to the new URL to get the *real* PDF file.

Now I suspect the "not working" scenario could be cause by the second way.

I'll try to setup an environment to test these two scenarios and see what
exactly happens.

Could you let me know, for the not working URLs like "...
doc1/038110817935", what if you manually navigate to that URL in a IE
browser? Do you get the PDF file?

Thanks,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
If I manually navigate to that URL in a IE browser, I get the login screen,
because a new session has been created.

If I redirect the web control to the URL "... doc1/038110817935" it will
show the web page, not the pdf file, this occurs for both web sites that work
and don't work.

To clarify, the web sites that work, DocumentComplete captures
/doc1/038110817935
and cgi-bin/show_temp.pl?file=pdf18445142694797&type=application/pdf
The web sites that don't work, DocumentComplete captures only
/doc1/038110817935


"Jie Wang [MSFT]" said:
Hi,

From the form of the URL we can't really see why some work while others not.

However, I can think of two ways of sending a PDF file to the client side
for it to be displayed in the entire browser control:

1. Write PDF stream directly to the response stream, while setting the
content type to application/pdf.

or, it can

2. Send a HTTP redirect code to the client, so the client is responsible
for re-sending the request to the new URL to get the *real* PDF file.

Now I suspect the "not working" scenario could be cause by the second way.

I'll try to setup an environment to test these two scenarios and see what
exactly happens.

Could you let me know, for the not working URLs like "...
doc1/038110817935", what if you manually navigate to that URL in a IE
browser? Do you get the PDF file?

Thanks,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
If I manually navigate to that URL in a IE browser, I get the login screen,
because a new session has been created.

If I redirect the web control to the URL "... doc1/038110817935" it will
show the web page, not the pdf file, this occurs for both web sites that work
and don't work.

To clarify, the web sites that work, DocumentComplete captures
/doc1/038110817935
and cgi-bin/show_temp.pl?file=pdf18445142694797&type=application/pdf
The web sites that don't work, DocumentComplete captures only
/doc1/038110817935


"Jie Wang [MSFT]" said:
Hi,

From the form of the URL we can't really see why some work while others not.

However, I can think of two ways of sending a PDF file to the client side
for it to be displayed in the entire browser control:

1. Write PDF stream directly to the response stream, while setting the
content type to application/pdf.

or, it can

2. Send a HTTP redirect code to the client, so the client is responsible
for re-sending the request to the new URL to get the *real* PDF file.

Now I suspect the "not working" scenario could be cause by the second way.

I'll try to setup an environment to test these two scenarios and see what
exactly happens.

Could you let me know, for the not working URLs like "...
doc1/038110817935", what if you manually navigate to that URL in a IE
browser? Do you get the PDF file?

Thanks,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi,

I've found another way to get the URL of the PDF file currently being
displayed "full screen" in the web browser control.

Actually, when the PDF document is displayed "full screen" (or shall we say
full control) in the web browser control, the Document property actually
returns an IAcroAxDocShim interface instead of an HTML DOM interface. This
matches the error description in your second post.

So all we need to do is check the type of the Document property and see if
it is IAcroAxDocShim, we can call src property on that interface and get
the URL of the PDF file.

The code looks like this (suppose I have a WebBrowser control named
AxWebBrowser1):

If (TypeOf AxWebBrowser1.Document Is AcroPDFLib.IAcroAXDocShim) Then
' this is a full screen PDF file
Dim pdfSrc As String
pdfSrc = CType(AxWebBrowser1.Document, AcroPDFLib.IAcroAXDocShim).src

Else
' this is a normal HTML page, process the page.
End If

To access the IAcroAxDocShim interface, you need to add a COM reference to
your project, named "Adobe Acrobat Browser Control Type Library 1.0".

What I'm not sure is whether or not we can download the file even we got
the URL in some cases - like if the page request requires an authenticated
session, this approach may still fail. I tried to save the PDF file via the
IAcroAXDocShim, but failed to find a way to do so. This control is made by
Adobe and I was not able to find a document of how to use it from their
website.

Anyway, please let me know if the IAcroAXDocShim can help you get the URL
first. Then we'll think of a way to get the file via the URL. You can also
try Adobe's online forum to ask questions related to the IAcroAXDocShim
interface to get more information.

Best regards,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi,

I've found another way to get the URL of the PDF file currently being
displayed "full screen" in the web browser control.

Actually, when the PDF document is displayed "full screen" (or shall we say
full control) in the web browser control, the Document property actually
returns an IAcroAxDocShim interface instead of an HTML DOM interface. This
matches the error description in your second post.

So all we need to do is check the type of the Document property and see if
it is IAcroAxDocShim, we can call src property on that interface and get
the URL of the PDF file.

The code looks like this (suppose I have a WebBrowser control named
AxWebBrowser1):

If (TypeOf AxWebBrowser1.Document Is AcroPDFLib.IAcroAXDocShim) Then
' this is a full screen PDF file
Dim pdfSrc As String
pdfSrc = CType(AxWebBrowser1.Document, AcroPDFLib.IAcroAXDocShim).src

Else
' this is a normal HTML page, process the page.
End If

To access the IAcroAxDocShim interface, you need to add a COM reference to
your project, named "Adobe Acrobat Browser Control Type Library 1.0".

What I'm not sure is whether or not we can download the file even we got
the URL in some cases - like if the page request requires an authenticated
session, this approach may still fail. I tried to save the PDF file via the
IAcroAXDocShim, but failed to find a way to do so. This control is made by
Adobe and I was not able to find a document of how to use it from their
website.

Anyway, please let me know if the IAcroAXDocShim can help you get the URL
first. Then we'll think of a way to get the file via the URL. You can also
try Adobe's online forum to ask questions related to the IAcroAXDocShim
interface to get more information.

Best regards,

Jie Wang ([email protected], remove 'online.')

Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business days is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Back
Top