HTTPWebRequest not working with Wikipedia

M

Mugunth

I'm trying to use a HTTPWebRequest class to retrieve a webpage. Below
is the following code....

string google = "http://www.google.com.sg/search?
hl=en&btnI=I'm+feeling+Lucky&q=";
string wikipedia = "http://en.wikipedia.org/wiki/
Special:Search?fulltext=Search&search=";

string website = wikipedia; // wikipedia does not work,
google works....

string query = textBoxUserQuery.Text;

// prepare the web page we will be asking for
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(website + query);

// execute the request
HttpWebResponse response = (HttpWebResponse)
request.GetResponse();

// we will read data via the response stream
Stream resStream = response.GetResponseStream();

Somehow, when is use google, I get a response, where as if I use
wikipedia, I get a Http Error stating
The remote server returned an error: (403) Forbidden.

The status says "System.Net.WebExceptionStatus.ProtocolError"

However I'm able to query for a page like http://en.wikipedia.org/wiki/Main_Page,
but cannot access the search page.

Am I missing something? Please help.

Mugunth
 
M

Mugunth

I've posted the complete code in my prev post.
It's a console app.

string google = "http://www.google.com.sg/search?
hl=en&btnI=I'm+feeling+Lucky&q=";
string wikipedia = "http://en.wikipedia.org/wiki/
Special:Search?fulltext=Search&search=";

string website = wikipedia; // wikipedia does not work,
google works....

string query = "Microsoft";

// prepare the web page we will be asking for
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(website + query);

// execute the request
HttpWebResponse response = (HttpWebResponse)
request.GetResponse();

// we will read data via the response stream
Stream resStream = response.GetResponseStream();

the request.GetResponse() call throws an exception when I use
wikipedia search but runs fine and returns a html page when I use
google.


Any Help is appreciated,
Mugunth
 
J

Jon Skeet [C# MVP]

Mugunth said:
I've posted the complete code in my prev post.
It's a console app.

Your previous post contained a reference to "textBoxUserQuery.Text"
which doesn't sound like a console app.

See http://pobox.com/~skeet/csharp/incomplete.html

If it doesn't start with using directives and a class declaration, it's
unlikely to be complete.

Try cutting and pasting what you've posted into a brand new text file
and compile it. It won't work.
 
M

Marc Gravell

I can reproduce the 403... at the end of the day, if they want to
prevent this type of access that is their prerogative?

You could probably go to town trying to spoof a standard request, but
I suspect you might be in violation of their policies (I haven't
checked).

Alternatively, host in a WebBrowser (which is shdocvw), or search for
a *supported* search API / web-service

Marc
 
N

Nicholas Paldino [.NET/C# MVP]

This action is disallowed by Wikipedia. If you check the Robots.txt
file:

http://en.wikipedia.org/robots.txt

You will see this in it:

User-agent: *
Disallow: /wiki/Special:Search

So your response of 403 - Forbidden is expected. They don't want you
doing this.
 
P

Peter Bromberg [C# MVP]

Wikipedia exposes its content for search via several APIs including a few
that have been written and are managed by third -parties. There is an XML
version that returns the MediaWiki markup for a result page inside an Xml
element. You would still have to convert the wiki markup to formatted HTML, a
process which is not trivial. As Nicholas indicated, Wikipedia doesn't want
people "faking" their seach box and redisplaying the scraped content.
-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
Short Urls & more: http://ittyurl.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top