HttpWebResponse.GetResponseStream returns incomplete stream

T

ThePants

Hi, given the following code, I've been successful in grabbing pages
for parsing, but for a certain page template (containing a particular
piece of code) the stream always ends right after that code. If you try
this with just about any type of url (incuding urls from the same site
without that piece of code) it works fine, but with urls containing the
piece of code, the stream is returned only up to that point.

Dim sURL as String
' Works (along with 1000's of other sites/templates/servers):
sURL = "http://www.msnbc.msn.com/id/14191819"
' Doesn't work:
sURL =
"http://www.time.com/time/business/article/0,8599,1226309,00.html"

Dim oSR As StreamReader = getPageContent(sURL)
' If you do oSR.ReadToEnd here, you'll see the page broken at the
wrong place

Private Function getPageContent(ByVal URL As String) As StreamReader
Dim oResponse As HttpWebResponse = Nothing
Dim oSR As StreamReader = Nothing
Dim oRequest As HttpWebRequest
Try
oRequest = WebRequest.Create(URL)
oResponse = CType(oRequest.GetResponse, HttpWebResponse)
oSR = New StreamReader(oResponse.GetResponseStream())
Catch ex As Exception

End Try
Return oSR
End Function

The stream for the time.com pages ends *every time* right after:
<strong>SUBSCRIBE TO TIME MAGAZINE FOR JUST $1.99</strong></a>

.... and the number of characters varies depending on the story, but
each time the "Subscribe" link is there, the response stream dies right
after it. If you view the source of those pages, you'll see a single
blank character, and then an html comment ( <!--cm_searchtext end-->).

So I'm stuck, is it possible that the single character between the </a>
and the comment is breaking the stream? Could it be the server thinking
(correctly) that I'm parsing it and choosing that as the location each
time to cut me off? (Changing the UserAgent property of the
HttpWebRequest doesn't affect the outcome at all). I've played with
several properties of HttpWebRequest, including spoofing a UserAgent,
setting KeepAlive to true, SendChunked, and ProtocolVersion... but
nothing I do seems to keep this from happening.

Any help would be appreciated.
Thanks!
STA
 
J

Joerg Jooss

Thus wrote ThePants,
Hi, given the following code, I've been successful in grabbing pages
for parsing, but for a certain page template (containing a particular
piece of code) the stream always ends right after that code. If you
try this with just about any type of url (incuding urls from the same
site without that piece of code) it works fine, but with urls
containing the piece of code, the stream is returned only up to that
point. [...]
... and the number of characters varies depending on the story, but
each time the "Subscribe" link is there, the response stream dies
right after it. If you view the source of those pages, you'll see a
single blank character, and then an html comment ( <!--cm_searchtext
end-->).

So I'm stuck, is it possible that the single character between the
</a> and the comment is breaking the stream? Could it be the server
thinking (correctly) that I'm parsing it and choosing that as the
location each time to cut me off? (Changing the UserAgent property of
the HttpWebRequest doesn't affect the outcome at all). I've played
with several properties of HttpWebRequest, including spoofing a
UserAgent, setting KeepAlive to true, SendChunked, and
ProtocolVersion... but nothing I do seems to keep this from happening.

That's a nasty one. At the point where the text is being truncated, there
is a NULL (0x00) character in the page. It's actually the Encoding object
that breaks here, not the response stream. Unfortunately, specifying a DecoderFallback
doesn't work -- seems to be a bug. As a work around, buffer the entire response
in MemoryStream, remove all NULL characters, and decode the buffer with an
Encoding instance.

Cheers,
 
T

ThePants

Joerg said:
That's a nasty one. At the point where the text is being truncated, there
is a NULL (0x00) character in the page. It's actually the Encoding object
that breaks here, not the response stream. Unfortunately, specifying a DecoderFallback
doesn't work -- seems to be a bug. As a work around, buffer the entire response
in MemoryStream, remove all NULL characters, and decode the buffer with an
Encoding instance.

Cheers,
--

Thanks very much for the reply, Joerg. This did the trick! Thank you
very very much for your help.
 
J

Joerg Jooss

Thus wrote (e-mail address removed),
hi...

could you show me how you did this?

OK, assuming you have a byte array "bytes" containing the entire response
all you need to do is:

using(MemoryStream buffer = new MemoryStream(bytes.Length)) {
foreach(byte b in bytes) {
if(b > 0x0) {
buffer.WriteByte(b);
}
}
bytes = buffer.ToArray();
}

// Assuming UTF-8 encoding here...
string response = Encoding.UTF8.GetString(bytes);

Cheers,
 
J

jake.oh

hi..
i have no words to show you how much i am appreciating your help.
but, i couldn't figure out how to capture the stream (from webrequest)
in byte arrays
could you help me out with this too?

best regards ^^
 
J

Joerg Jooss

Thus wrote (e-mail address removed),
hi..
i have no words to show you how much i am appreciating your help.
but, i couldn't figure out how to capture the stream (from webrequest)
in byte arrays
could you help me out with this too?
best regards ^^

That's System.IO 101 ;-)

Here's a method that sends a HttpWebRequest and copies its response to an
arbitrary Stream object. If you pass a MemoryStream as "outStream", you'll
get what you want.

private void SendRequest(HttpWebRequest request, Stream outStream) {
Debug.Assert(outStream.CanWrite);

using(HttpWebResponse response = (HttpWebResponse) request.GetResponse())
using(Stream responseStream = response.GetResponseStream()) {
byte[] buffer = new byte[0x1000];
int bytes;
while((bytes = responseStream.Read(buffer, 0, buffer.Length)) > 0) {
outStream.Write(buffer, 0, bytes);
}
}
}

Cheers,
 
T

ThePants

Here's my Function in vb.net. Probably not terribly efficient, but I
needed to copy the string back to a memorystream as output. Thanks
again to Joerg for the suggestion.

Private Function getPageContent(ByVal URL As String) As MemoryStream
Dim oResponse As HttpWebResponse = Nothing
Dim oSB As New StringBuilder
Dim oRequest As HttpWebRequest
Try
oRequest = WebRequest.Create(URL)
oRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows
NT 5.2; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
oResponse = CType(oRequest.GetResponse, HttpWebResponse)
Dim oStreamResponse As Stream = oResponse.GetResponseStream()
Dim oStreamRead As New StreamReader(oStreamResponse)
Dim readBuff(256) As [Char]
Dim nCount As Integer = oStreamRead.Read(readBuff, 0, 256)
While nCount > 0
Dim outputData As New [String](readBuff, 0, nCount)
outputData = Replace(outputData, vbNullChar, "")
oSB.Append(outputData)
nCount = oStreamRead.Read(readBuff, 0, 256)
End While
oStreamResponse.Close()
oStreamRead.Close()
Catch ex As Exception

End Try
Dim oWorkStream As New MemoryStream
Dim oEnc As Encoding = Encoding.GetEncoding(1252)
Dim oSW1 As New StreamWriter(oWorkStream, oEnc)
oSW1.Write(oSB.ToString)
oSW1.Flush()
oWorkStream.Position = 0
Return oWorkStream
End Function
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top