HttpWebResponse.GetResponseStream returns incomplete stream

ThePants · Aug 22, 2006

Hi, given the following code, I've been successful in grabbing pages
for parsing, but for a certain page template (containing a particular
piece of code) the stream always ends right after that code. If you try
this with just about any type of url (incuding urls from the same site
without that piece of code) it works fine, but with urls containing the
piece of code, the stream is returned only up to that point.

Dim sURL as String
' Works (along with 1000's of other sites/templates/servers):
sURL = "http://www.msnbc.msn.com/id/14191819"
' Doesn't work:
sURL =
"http://www.time.com/time/business/article/0,8599,1226309,00.html"

Dim oSR As StreamReader = getPageContent(sURL)
' If you do oSR.ReadToEnd here, you'll see the page broken at the
wrong place

Private Function getPageContent(ByVal URL As String) As StreamReader
Dim oResponse As HttpWebResponse = Nothing
Dim oSR As StreamReader = Nothing
Dim oRequest As HttpWebRequest
Try
oRequest = WebRequest.Create(URL)
oResponse = CType(oRequest.GetResponse, HttpWebResponse)
oSR = New StreamReader(oResponse.GetResponseStream())
Catch ex As Exception

End Try
Return oSR
End Function

The stream for the time.com pages ends *every time* right after:
<strong>SUBSCRIBE TO TIME MAGAZINE FOR JUST $1.99</strong></a>

.... and the number of characters varies depending on the story, but
each time the "Subscribe" link is there, the response stream dies right
after it. If you view the source of those pages, you'll see a single
blank character, and then an html comment ( ).

So I'm stuck, is it possible that the single character between the </a>
and the comment is breaking the stream? Could it be the server thinking
(correctly) that I'm parsing it and choosing that as the location each
time to cut me off? (Changing the UserAgent property of the
HttpWebRequest doesn't affect the outcome at all). I've played with
several properties of HttpWebRequest, including spoofing a UserAgent,
setting KeepAlive to true, SendChunked, and ProtocolVersion... but
nothing I do seems to keep this from happening.

Any help would be appreciated.
Thanks!
STA

Joerg Jooss · Aug 24, 2006

Thus wrote ThePants,

Hi, given the following code, I've been successful in grabbing pages
for parsing, but for a certain page template (containing a particular
piece of code) the stream always ends right after that code. If you
try this with just about any type of url (incuding urls from the same
site without that piece of code) it works fine, but with urls
containing the piece of code, the stream is returned only up to that
point. [...]
... and the number of characters varies depending on the story, but
each time the "Subscribe" link is there, the response stream dies
right after it. If you view the source of those pages, you'll see a
single blank character, and then an html comment ( ).

So I'm stuck, is it possible that the single character between the
</a> and the comment is breaking the stream? Could it be the server
thinking (correctly) that I'm parsing it and choosing that as the
location each time to cut me off? (Changing the UserAgent property of
the HttpWebRequest doesn't affect the outcome at all). I've played
with several properties of HttpWebRequest, including spoofing a
UserAgent, setting KeepAlive to true, SendChunked, and
ProtocolVersion... but nothing I do seems to keep this from happening.

That's a nasty one. At the point where the text is being truncated, there
is a NULL (0x00) character in the page. It's actually the Encoding object
that breaks here, not the response stream. Unfortunately, specifying a DecoderFallback
doesn't work -- seems to be a bug. As a work around, buffer the entire response
in MemoryStream, remove all NULL characters, and decode the buffer with an
Encoding instance.

Cheers,

ThePants · Aug 25, 2006

Joerg said:
That's a nasty one. At the point where the text is being truncated, there
is a NULL (0x00) character in the page. It's actually the Encoding object
that breaks here, not the response stream. Unfortunately, specifying a DecoderFallback
doesn't work -- seems to be a bug. As a work around, buffer the entire response
in MemoryStream, remove all NULL characters, and decode the buffer with an
Encoding instance.

Cheers,
--

Thanks very much for the reply, Joerg. This did the trick! Thank you
very very much for your help.

jake.oh · Aug 28, 2006

hi...

could you show me how you did this?

best regards

Joerg Jooss · Aug 28, 2006

Thus wrote (e-mail address removed),

hi...

could you show me how you did this?

OK, assuming you have a byte array "bytes" containing the entire response
all you need to do is:

using(MemoryStream buffer = new MemoryStream(bytes.Length)) {
foreach(byte b in bytes) {
if(b > 0x0) {
buffer.WriteByte(b);
}
}
bytes = buffer.ToArray();
}

// Assuming UTF-8 encoding here...
string response = Encoding.UTF8.GetString(bytes);

Cheers,

jake.oh · Aug 29, 2006

hi..
i have no words to show you how much i am appreciating your help.
but, i couldn't figure out how to capture the stream (from webrequest)
in byte arrays
could you help me out with this too?

best regards ^^

Joerg Jooss · Aug 29, 2006

Thus wrote (e-mail address removed),

hi..
i have no words to show you how much i am appreciating your help.
but, i couldn't figure out how to capture the stream (from webrequest)
in byte arrays
could you help me out with this too?
best regards ^^

That's System.IO 101 ;-)

Here's a method that sends a HttpWebRequest and copies its response to an
arbitrary Stream object. If you pass a MemoryStream as "outStream", you'll
get what you want.

private void SendRequest(HttpWebRequest request, Stream outStream) {
Debug.Assert(outStream.CanWrite);

using(HttpWebResponse response = (HttpWebResponse) request.GetResponse())
using(Stream responseStream = response.GetResponseStream()) {
byte[] buffer = new byte[0x1000];
int bytes;
while((bytes = responseStream.Read(buffer, 0, buffer.Length)) > 0) {
outStream.Write(buffer, 0, bytes);
}
}
}

Cheers,

ThePants · Aug 29, 2006

Here's my Function in vb.net. Probably not terribly efficient, but I
needed to copy the string back to a memorystream as output. Thanks
again to Joerg for the suggestion.

Private Function getPageContent(ByVal URL As String) As MemoryStream
Dim oResponse As HttpWebResponse = Nothing
Dim oSB As New StringBuilder
Dim oRequest As HttpWebRequest
Try
oRequest = WebRequest.Create(URL)
oRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows
NT 5.2; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
oResponse = CType(oRequest.GetResponse, HttpWebResponse)
Dim oStreamResponse As Stream = oResponse.GetResponseStream()
Dim oStreamRead As New StreamReader(oStreamResponse)
Dim readBuff(256) As [Char]
Dim nCount As Integer = oStreamRead.Read(readBuff, 0, 256)
While nCount > 0
Dim outputData As New [String](readBuff, 0, nCount)
outputData = Replace(outputData, vbNullChar, "")
oSB.Append(outputData)
nCount = oStreamRead.Read(readBuff, 0, 256)
End While
oStreamResponse.Close()
oStreamRead.Close()
Catch ex As Exception

End Try
Dim oWorkStream As New MemoryStream
Dim oEnc As Encoding = Encoding.GetEncoding(1252)
Dim oSW1 As New StreamWriter(oWorkStream, oEnc)
oSW1.Write(oSB.ToString)
oSW1.Flush()
oWorkStream.Position = 0
Return oWorkStream
End Function

HttpRequest Question	14	Mar 28, 2007
HTTPWebResponse Timeout problems	1	Oct 25, 2005
Long files: HttpWebRequest & StreamRead	2	Jul 25, 2005
The remote server returned an error: (401)	3	Dec 29, 2004
Internal Server Error	2	Nov 16, 2005
Request.TotalBytes always 0	0	Nov 2, 2006
Basic authentication Fails	3	Mar 5, 2008
Help: Trying to Load URL and save to File in Dot.Net	2	Sep 9, 2004

HttpWebResponse.GetResponseStream returns incomplete stream

ThePants

Joerg Jooss

ThePants

jake.oh

Joerg Jooss

jake.oh

Joerg Jooss

ThePants

Ask a Question

Similar Threads