Text Processing Headache - Please Help.

G

garyusenet

Hi All,

I have been working on the following programme over the last day or so
and have made a good deal of progress. It is a very simple programme,
but is proving very useful as a learning aid, and will eventually be
useful to me in it's own right.

It function is to open a text file, and remove HTTP addresses from the
file. The file is always in a certain format, and the HTTP address is
always proceeded by a key phrase.

So far I have got as far as opening the file and removing all the junk
before the first site is listed. What I'm trying to do now is split the
remaining string into an array. I want to use a phrase as the
delimitter and reading this forum I found a previous poster had the
same problem and someone suggested him using the Regex function.

I've tried that but am getting a result I don't understand. When I
check the size of the array the Regex function has created for me it is
far too small.

The code i'm talking about is here: -

string[] sitearray = Regex.Split(shortenedstring, "num=");
MessageBox.Show(sitearray.Length.ToString());

Now when i open the file in notepad and count the number of 'num='
occurances there are between 7-10 in each file i test. But i'm getting
arrays of sizes sometimes as low as 3.

The full code to my programme is here (it's quite a simple programme!)

http://rafb.net/paste/results/b1p5uU35.html

I look forward to your feedback,

Thankyou,

Gary.
 
G

Greg Bacon

Help us answer your question by giving sample input that demonstrates
the problem!

Greg
 
R

rossum

Hi All,

I have been working on the following programme over the last day or so
and have made a good deal of progress. It is a very simple programme,
but is proving very useful as a learning aid, and will eventually be
useful to me in it's own right.

It function is to open a text file, and remove HTTP addresses from the
file. The file is always in a certain format, and the HTTP address is
always proceeded by a key phrase.

So far I have got as far as opening the file and removing all the junk
before the first site is listed. What I'm trying to do now is split the
remaining string into an array. I want to use a phrase as the
delimitter and reading this forum I found a previous poster had the
same problem and someone suggested him using the Regex function.

I've tried that but am getting a result I don't understand. When I
check the size of the array the Regex function has created for me it is
far too small.

The code i'm talking about is here: -

string[] sitearray = Regex.Split(shortenedstring, "num=");
MessageBox.Show(sitearray.Length.ToString());

Now when i open the file in notepad and count the number of 'num='
occurances there are between 7-10 in each file i test. But i'm getting
arrays of sizes sometimes as low as 3.

The full code to my programme is here (it's quite a simple programme!)

http://rafb.net/paste/results/b1p5uU35.html

I look forward to your feedback,

Thankyou,

Gary.

Putting the guts of your program into a Console test it seemed to work
fine, except for the Regex.Split returning an initial null string.

Here is my version of your code (careful with the line wrap!):

class ConsoleScratch {

static int LocateStartOfSubString(string FullString, string
SubString) {
int FirstChr = FullString.IndexOf(SubString);
//SHOWS START POSITION OF SUBSTRING
return FirstChr;
}


static void Main() {

// Dummy file contents, easier for testing.
string filebuffer = "xxxxx sites num=first, num=second,
num=third, " +
"num=fourth, num=fifth, num=sixth,
num=seventh, " +
"num=eighth, num=ninth, num=tenth";

// cut off everything before adurl

string substring = "sites";
int mainindexofinterest =
LocateStartOfSubString(filebuffer, substring);
string strippedstring = filebuffer.Remove(0,
mainindexofinterest);
string shortenedstring = strippedstring.Remove(0, 5);
//remove the sites phrase
//MessageBox.Show(shortenedstring);
Console.WriteLine("Shortened string: >{0}<\n",
shortenedstring);

//int spaceafteraddressindex =
LocateStartOfSubString(shortenedstring, " ");
//string firstwebaddress =
shortenedstring.Remove(spaceafteraddressindex);

// You can use Trim to remove leading and trailing spaces:
shortenedstring = shortenedstring.Trim();
Console.WriteLine("Trimmed string: >{0}<\n",
shortenedstring);


string[] sitearray = Regex.Split(shortenedstring, "num=");
//MessageBox.Show(sitearray.Length.ToString());
Console.WriteLine("sitearray.Length = {0}",
sitearray.Length);
foreach (string s in sitearray) {
Console.WriteLine(" >{0}<", s);
}

Console.Write("Press [Enter] to continue... ");
Console.ReadLine();
} // end Main()
}

This gave the results:

Shortened string: > num=first, num=second, num=third, num=fourth,
num=fifth, num=sixth, num=seventh, num=eighth, num=ninth, num=tenth<

Trimmed string: >num=first, num=second, num=third, num=fourth,
num=fifth, num=sixth, num=seventh, num=eighth, num=ninth, num=tenth<

sitearray.Length = 11
<
first, <
second, <
third, <
fourth, <
fifth, <
sixth, <
seventh, <
eighth, <
ninth, <
tenth<

Since your code seems to work as expected, I would think that the
problem might lie somewhere in your input file. Try changing the
input file in various ways to see if that has any effect. For
example, do you have "num =" instead of "num=" anywhere?

While testing it is also worth showing the full contents of sitearray,
which your original code did not do, so you can see what your program
is actually doing. That could well help with diagnosing the problem.

As an aside, full HTTP addresses will always start "http://" or
"HTTP://" with HTTPS addresses using "https://" or "HTTPS://" which
may help you.

rossum
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top