C SHARP - Parsing URL for Variable

  • Thread starter Jozef Jarosciak
  • Start date
J

Jozef Jarosciak

Hi everyone,

I am building a web crawler and one of the features which I need to
include is exclusion of specified 'variable + value' from the url.

Example, user wanted to extract variable "s":

So when you look at this url:
"http://www.goldenretrieverforum.com/search.php?s=5817617a59fb630a7f40846e4a29efc1&do=getdaily"

, it has a variable 's' and its value, plus some other variables.

I need a code which would shorten that url to this:
"http://www.goldenretrieverforum.com/search.php?do=getdaily"
, extracting variable 's' completely.



But it needs to be smart to such point, that is variable 's' is the
last variable in the link, like this:

"http://www.goldenretrieverforum.com/search.php?s=5817617a59fb630a7f40846e4a29efc1"

, it would correctly fix it to:
"http://www.goldenretrieverforum.com/search.php"


Can someone help me write REGEX or point me to site which has such
regex written already?

Or is there any other way to do this?

Thanks a lot for your time and help.

Joe
 
G

gs

try regex
I am not good at this but I try give you a hint and example I saw
sample pattern that splits url into 5 match groups:scheme,
authority, path, query, fragment
"({[^:/?#]+}:)?(//{[^/?#]*})?{[^?#]*}(?{[^#]*})?(#{.*})?"
"({[^:/?#]+}:) // scheme
?(//{[^/?#]*}) //authority
?{[^?#]*} // path
(?{[^#]*})? // query
(#{.*})?" // qusty
you use match.groups for the above

For detail instructions, search for regex on msdn

pattern could be something like your s variable and value could be
"s=[a-z0-9]*\&"

You could even use regex to remove to the pattern "s=[a-z0-9]*\&" from the
url.
 
J

Jozef Jarosciak

Hi, thanks, when it comes to regex I am completely off.
Is there anyone who could write this (supposedly) simple regex for
extraction of variable 's' from the url?
joe
 
D

Dave

given:

http:// nowhere.com /folder/file.txt?v1=hello%20world

expression could be:

(?<variable> (?<name> .*?) = (?<value> .*?) \& | $ )

Use a "Uri" object to grab the querystring only and to reset it later if required

The above expression should parse all name=value pairs from a querystring.
(Must use RegexOptions.IgnorePatternWhiteSpace and RegexOptions.Singleline)

You can use the Group["name"].Value, etc to obtain the strings to compare
You can use group indexes to manipulate the querystring
 
A

Andrei Pociu

Hello,

I wrote this tutorial a while ago on parsing using the URI class:
http://www.geekpedia.com/tutorial68_Using-the-URI-Class.html
Hopefully it will get you on the right path, at least.

Andrei

Dave said:
given:

http:// nowhere.com /folder/file.txt?v1=hello%20world

expression could be:

(?<variable> (?<name> .*?) = (?<value> .*?) \& | $ )

Use a "Uri" object to grab the querystring only and to reset it later if
required

The above expression should parse all name=value pairs from a querystring.
(Must use RegexOptions.IgnorePatternWhiteSpace and
RegexOptions.Singleline)

You can use the Group["name"].Value, etc to obtain the strings to compare
You can use group indexes to manipulate the querystring

--
Dave Sexton
[email protected]
-----------------------------------------------------------------------
Jozef Jarosciak said:
Hi, thanks, when it comes to regex I am completely off.
Is there anyone who could write this (supposedly) simple regex for
extraction of variable 's' from the url?
joe
 
G

Greg Bacon

: I am building a web crawler and one of the features which I need to
: include is exclusion of specified 'variable + value' from the url.
:
: Example, user wanted to extract variable "s":
: [...]

The following simple example should provide a start:

static string EraseEssParameter(Match uri)
{
Regex parmpat = new Regex(@"\b(?<name>[^=&]+)=[^#&]*");
Match parm = parmpat.Match(uri.Groups["query"].ToString());

ArrayList parms = new ArrayList();
while (parm.Success)
{
if (parm.Groups["name"].ToString() != "s")
parms.Add(parm.ToString());

parm = parm.NextMatch();
}

string query = String.Join("&", (string[]) parms.ToArray(typeof(string)));
if (query != "")
query = "?" + query;

return uri.Groups["before"] + query + uri.Groups["after"];
}

static void Main(string[] args)
{
string[] inputs =
{
"http://www.goldenretrieverforum.com/search.php?s=ZZZ&do=getdaily",
"http://www.goldenretrieverforum.com/search.php?s=ZZZ",
};

// from Appendix B of RFC 2396
Regex rfc2396 = new Regex(
@"^" +
@"(?<before>" +
@"(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)" +
@")" +
@"(\?(?<query>[^#]*))?" +
@"(?<after>(#(.*))?)",
RegexOptions.ExplicitCapture);

MatchEvaluator sremove = new MatchEvaluator(EraseEssParameter);
foreach (string uri in inputs)
{
Console.WriteLine("BEFORE: " + uri);

Console.WriteLine("AFTER: " + rfc2396.Replace(uri, sremove));
}
}

Hope this helps,
Greg
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top