How to judge whether content type is truly "text/html"?

  • Thread starter Thread starter Morgan Cheng
  • Start date Start date
M

Morgan Cheng

I know that HttpWebRequest.GetResponse() generates a HttpWebResonse.
The response has one ContentType property. But the property is just
decided by http response header. It is possible that the content is
actually HTML, while the ContentType is "image/jpeg".

Is there any effective way to judge whether the response type is truly
"text"?
I have a idea to read the first several bytes of the response stream;
and check whether they are real displayable characters. But, they can
be any kind of Encoding. Should I try all kinds of Encoding?
 
The property is not decided by the HTTP Response Header. It is decided by
the web server and/or the developer who created the web site. The problem
here is, the reason for the ContentType header is to tell the client what is
stored in the stream of bits it is sending. Since a stream of bits is just
1's and 0's there's no way to tell without it.

However, I have never heard of what you describe happening. If it did,
browsers would not be able to view the content, and whomever created the web
site would know about it very shortly (from the response of the users).

--
HTH,

Kevin Spencer
Microsoft MVP
Software Composer

A watched clock never boils.
 
Kevin Spencer 写é“:
The property is not decided by the HTTP Response Header. It is decided by
the web server and/or the developer who created the web site. The problem
here is, the reason for the ContentType header is to tell the client whatis
stored in the stream of bits it is sending. Since a stream of bits is just
1's and 0's there's no way to tell without it.
Yes, web server can config the response MIME type, which turns to be in
HTTP response header. That is my understanding.
However, I have never heard of what you describe happening. If it did,
browsers would not be able to view the content, and whomever created the web
site would know about it very shortly (from the response of the users).
I tried to manually set one html to be "image/jpeg" type in IIS6. Then
access the page from another machine and ambush the http package with
Fiddle. It shows that the response header has "ContentType:
image/jpeg". Interestingly, IE still show the html page, while Firefox
cannot show it up. It looks that IE does further job.
 
Vadym Stetsyak 写é“:
Hello, Morgan!

MC> I know that HttpWebRequest.GetResponse() generates a HttpWebResonse.
MC> The response has one ContentType property. But the property is just
MC> decided by http response header. It is possible that the content is
MC> actually HTML, while the ContentType is "image/jpeg".

If you're talking to "well-behaved" web server, then it gives you the content
from the set you've specified in the Accept header.
I agree.
It happens to me to handle some un-normal situation:p
MC> Is there any effective way to judge whether the response type is truly
MC> "text"?
MC> I have a idea to read the first several bytes of the response stream;
MC> and check whether they are real displayable characters. But, they can
MC> be any kind of Encoding. Should I try all kinds of Encoding?

IMO there is no good way how verify if it is "text".
As a workaround you can check the response content for the subset of printable
characters...
The problem is the encoding.
However, html lang are in English which is 33-127 in most of Encoding.
Perhaps try to parse some html tag works.
 
Vadym Stetsyak 写é“:
Hello, Morgan!

MC> I know that HttpWebRequest.GetResponse() generates a HttpWebResonse.
MC> The response has one ContentType property. But the property is just
MC> decided by http response header. It is possible that the content is
MC> actually HTML, while the ContentType is "image/jpeg".

If you're talking to "well-behaved" web server, then it gives you the content
from the set you've specified in the Accept header.
I agree.
It happens to me to handle some un-normal situation:p
MC> Is there any effective way to judge whether the response type is truly
MC> "text"?
MC> I have a idea to read the first several bytes of the response stream;
MC> and check whether they are real displayable characters. But, they can
MC> be any kind of Encoding. Should I try all kinds of Encoding?

IMO there is no good way how verify if it is "text".
As a workaround you can check the response content for the subset of printable
characters...
The problem is the encoding.
However, html lang are in English which is 33-127 in most of Encoding.
Perhaps try to parse some html tag works.
 
Hi Morgan,

Your expreience underscores my point. While it is possible to manually (or,
perhaps unintentionally) change the ContentType header, any web site that
did would find out about it very quickly, because there are many different
browsers in use out there, and they would hear about the problem and fix it.

It isn't productive to imagine the most remote of possibilities and handle
them gracefully. If one did, one would never finish much of anything.
Sometimes the most graceful thing to do is to handle the error as an error
and move on. My guess is that you would never run into the issue at all.

--
HTH,

Kevin Spencer
Microsoft MVP
Software Composer

A watched clock never boils.


Kevin Spencer ??:
The property is not decided by the HTTP Response Header. It is decided by
the web server and/or the developer who created the web site. The problem
here is, the reason for the ContentType header is to tell the client what
is
stored in the stream of bits it is sending. Since a stream of bits is just
1's and 0's there's no way to tell without it.
Yes, web server can config the response MIME type, which turns to be in
HTTP response header. That is my understanding.
However, I have never heard of what you describe happening. If it did,
browsers would not be able to view the content, and whomever created the
web
site would know about it very shortly (from the response of the users).
I tried to manually set one html to be "image/jpeg" type in IIS6. Then
access the page from another machine and ambush the http package with
Fiddle. It shows that the response header has "ContentType:
image/jpeg". Interestingly, IE still show the html page, while Firefox
cannot show it up. It looks that IE does further job.
 
Kevin said:
Hi Morgan,

Your expreience underscores my point. While it is possible to manually (or,
perhaps unintentionally) change the ContentType header, any web site that
did would find out about it very quickly, because there are many different
browsers in use out there, and they would hear about the problem and fix it.

It isn't productive to imagine the most remote of possibilities and handle
them gracefully. If one did, one would never finish much of anything.
Sometimes the most graceful thing to do is to handle the error as an error
and move on. My guess is that you would never run into the issue at all.

I happen to find one function FindMimeFromData in UrlMon.dll. It
works.

http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top