how to split a string using ,fixed character length, variable text delimmiter

G

garyusenet

I'm working on a data file and can't find any common delimmiters in the
file to indicate the end of one row of data and the start of the next.
Rows are not on individual lines but run accross multiple lines.

It would appear though that every distinct set of data starts with a
'code' that is always the 25 characters long. The text is variable
however.

Assuming i've read the contents of the file into the string myfile, how
do i split my file into an array, using this variable text, fixed 25
character long, delimiter?

Thankyou!

Gary-
 
O

Oliver Sturm

Hello,
Assuming i've read the contents of the file into the string myfile, how
do i split my file into an array, using this variable text, fixed 25
character long, delimiter?


You should probably be able to use Regex.Split(...), with a good regular
expression of course. I can give you help on writing that regular
expression, but I'll have to know a lot more about the delimiter string.


Oliver Sturm
 
P

Peter Bradley

How do *you* know it's a delimiter and not data?

In other words, if *I* were to look at the file, knowing nothing about it,
how could I tell what was a delimiter and what was data? How would you
explain to me what to look for?

When you can answer that, you can start thinking about how to pass that
information to a machine.

HTH


Peter
 
G

garyusenet

Thankyou for your replies. OK I have had another look at i think the
task has just got harder. The length isn't always 25 characters. But I
have found a pattern, hopefully this will help.


I am using this 'code' as a delimmiter because it always proceeds the
name of an item, and this file is essentially a database of items.
Following the name of an item, a number of item characteristcs specific
to that item are listed. Eventually the items characteristics are
completely listed and the next 'code' is encountered which proceeds the
next item in the database.

There does seem to be some identifiable traits of this code.
It appears to be always at least 20 characters long.

- The code is continuous there are no spaces present.
- It is always composed of letters ranging from A-Z, or numbers 0-9.
- The first two characters of this code are always letters raning from
A-Z.
- These two letters are repeated at least two other times during the
code.

e.g.

DODE86DODE86SZDO010144

So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.

I think this is going to be tough?

Any ideas?

Thankyou-
 
G

garyusenet

In case it wasn't obvious I would also like to add that the code has at
least one space at the start of it and the end of it.
 
C

Chris Dunaway

There does seem to be some identifiable traits of this code.
Who created this file? Are there no documentation which describes its
format? Can you post a sample of the data that shows at least 2
complete "records" or items? Is there anything in the file, perhaps a
header of some sort, that can shed any light on the format?

Chris
 
O

Oliver Sturm

Hello,
DODE86DODE86SZDO010144

So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.

Yes, well... you could try using a regular expression such as this:

[ ]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]

(This could also be simplified a bit.)
This does evaluate the double repetition of the initial two characters,
but it can't check the maximum length of the string at the same time. If
you'd just be searching the text in question for occurrences of the
expression, you could easily write an additional check in code, to find
out whether any given string has the correct maximum length. But if you're
using this expression in a Split() call, you couldn't do that...

Personally I would probably still use an expression such as this, to
search for that is, and do the splitting myself. If you can't do the
splitting fully automatically, you'll have to do it yourself in any case -
and using a regular expression to do the delimiter searching seems a
better option to me than coding up the search in C#.



Oliver Sturm
 
D

DeveloperX

I agree, that's' wonky sounding data :) How about something like this?
It's not quite what you want, but might give you a start. currently it
just finds the codes as you've described them and returns them, but
I've got to do some real work so...
I'm not great with Regular expressions so I've only used one to check
the first two characters occur three times. Oh and it's very scrappy.

private void fooTest2()
{
foreach(string s in
foo2(",,,12tt12ttt12ttttttttt,,ab111ab11111111111ab,"))
{
Console.WriteLine(s);
}
}
private System.Collections.ArrayList foo2(string pFoo)
{
int i;
int j;
int o=0;
int p=0;
System.Text.RegularExpressions.Regex r;
bool running=true;
char[] c;
String s;
System.Collections.ArrayList a = new ArrayList();
c=pFoo.ToCharArray();

for(i=0, j=0; i<c.Length ; i=j)
{
for(;j<c.Length;j++)
{
if(IsAN(c[j]))
{
if(running)
{
p++;
}
else
{
running = true;
p=1;
o=j;
}
}
else
{
running = false;
}
if(20 == p)
{
r = new
System.Text.RegularExpressions.Regex(pFoo.Substring(o,2));
s=pFoo.Substring(o,j-o+1);
if(3 == r.Matches(s).Count)
{
p=0;
running = false;
a.Add(s);
}
else
{
running=false;
}
}
}
}
return a;
}
private bool IsAN(char pC)
{
char c = pC.ToString().ToUpper().ToCharArray()[0];
if('A' <= c && 'Z' >= c)
{
return true;
}
if('0' <= pC && '9' >= pC)
{
return true;
}
return false;
}
 
G

garyusenet

Hi it's used in a custom written programme where I work which is dos
based.
The developers have long since dissapeared.

I'd really like to know how to achieve this in code if possible,

Thanks,

Gary-
 
G

garyusenet

This group is amazing. Thankyou both very much, i'm going to explore
them both now.

Oliver said:
Hello,
DODE86DODE86SZDO010144

So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.

Yes, well... you could try using a regular expression such as this:

[ ]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]

(This could also be simplified a bit.)
This does evaluate the double repetition of the initial two characters,
but it can't check the maximum length of the string at the same time. If
you'd just be searching the text in question for occurrences of the
expression, you could easily write an additional check in code, to find
out whether any given string has the correct maximum length. But if you're
using this expression in a Split() call, you couldn't do that...

Personally I would probably still use an expression such as this, to
search for that is, and do the splitting myself. If you can't do the
splitting fully automatically, you'll have to do it yourself in any case -
and using a regular expression to do the delimiter searching seems a
better option to me than coding up the search in C#.



Oliver Sturm
 
G

garyusenet

I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.

Im getting an error at Regex.Split(...)

Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);

MessageBox.Show(filecontents.Length.ToString());

Thankyou
 
D

DeveloperX

try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.
 
G

garyusenet

Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.

The regex doesn't like \1 any suggestions what this should be changed
to?

Thanks,

Gary-
try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.

I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.

Im getting an error at Regex.Split(...)

Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);

MessageBox.Show(filecontents.Length.ToString());

Thankyou
 
D

DeveloperX

Bizarre, I pasted the regex into my little test app with the @ of
course and it fires

foo3(",,,12tt12ttt12ttttttttt,, ABCCCABCCCCCCCCCABCC ,");

private void foo3(string pFoo)
{
System.Text.RegularExpressions.Regex r = new
System.Text.RegularExpressions.Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Console.WriteLine(r.Matches(pFoo).Count.ToString());
string[] s = r.Split(pFoo);
}

the \1 refers to the first group ]([A-Z][A-Z]) so what this regex is
saying is match a space then XX then any combination of X or N then the
XX found earlier, then more XX or N then our original XX again followed
by more X or N then a space iirc. X - A-Z, N = 0-9.

You might also wish to look at
System.Text.RegularExpressions.RegexOptions enum which sets things like
case sensitivity, multi line support and so forth. As you can see above
I didn't set anything and just took the defaults.

What is the actual error you got?


Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.

The regex doesn't like \1 any suggestions what this should be changed
to?

Thanks,

Gary-
try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.

I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.

Im getting an error at Regex.Split(...)

Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);

MessageBox.Show(filecontents.Length.ToString());

Thankyou
 
O

Oliver Sturm

Hello,
Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.

No, using the @ should be just fine, I usually do that myself.
The regex doesn't like \1 any suggestions what this should be changed
to?

That's if you don't use the @, right?

I'm not really sure what the problem might be - of course my expression is
working with a lot of assumptions that you and I have been making in this
discussion, so accordingly there may be a lot of reasons why you're not
"getting the desired results" :)

I checked that my expression worked with the delimiter string you
previously posted, but nothing else of course. If you can post further
examples of the delimiter string, maybe that would help... otherwise, feel
free to send me a sample program or a sample data file by email (I think
attachments can't be posted to this group?) or something and I'll have a
look.


Oliver Sturm
 
G

garyusenet

Thankyou DeveloperX. I don't get an error with the @ only when i remove
the @.

I removed the @ because when i include it the result isn't what i
expected.

If i run this with the @ and then check the length of arraylist its
3055.

now the sample file im running it on has three of these 'codes' and
three rows of data.

e.g.

HUa82ab8HU272ajHUeje <lots of other text here running over multiple
lines> UNa8723oansjaUNasUNa <more text here running over many lines>
IN8aatjresINiys9aINsa <more text here>

Now i thought i would get all the text between <...> into individual
arraylist elements by running this but i'm not... what am i doing
wrong?

Thankyou
Gary-

I was expecting each part of my arraylist
e.g. [0], [1], ...
to contain everything between a set of codes.




Bizarre, I pasted the regex into my little test app with the @ of
course and it fires

foo3(",,,12tt12ttt12ttttttttt,, ABCCCABCCCCCCCCCABCC ,");

private void foo3(string pFoo)
{
System.Text.RegularExpressions.Regex r = new
System.Text.RegularExpressions.Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Console.WriteLine(r.Matches(pFoo).Count.ToString());
string[] s = r.Split(pFoo);
}

the \1 refers to the first group ]([A-Z][A-Z]) so what this regex is
saying is match a space then XX then any combination of X or N then the
XX found earlier, then more XX or N then our original XX again followed
by more X or N then a space iirc. X - A-Z, N = 0-9.

You might also wish to look at
System.Text.RegularExpressions.RegexOptions enum which sets things like
case sensitivity, multi line support and so forth. As you can see above
I didn't set anything and just took the defaults.

What is the actual error you got?


Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.

The regex doesn't like \1 any suggestions what this should be changed
to?

Thanks,

Gary-
try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.

(e-mail address removed) wrote:
I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.

Im getting an error at Regex.Split(...)

Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);

MessageBox.Show(filecontents.Length.ToString());

Thankyou
 
D

DeveloperX

Here's foo3 again with some extra code. The interesting bit is the for
loop at the end. If you print out what it's matching and the position
in the source data we can see what's going on. Is it feasable to post
the test data?
On the @ thing, c# uses escape characters in strings so \ followed by a
character has different meanings. \t is tab (iirc) What the @ does
before the string is tell the compiler that everything in the quotes is
now a literal string and it shouldn't get fancy and try and replace \1
with what it things \1 should mean (or crash when it doesn't know what
it is :)).

private void foo3(string pFoo)
{
System.Text.RegularExpressions.Regex r = new
System.Text.RegularExpressions.Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");

Console.WriteLine(r.Matches(pFoo).Count.ToString());

string[] s = r.Split(pFoo);
//Console.WriteLine(s[0]);
//Console.WriteLine(s[1]);

System.Text.RegularExpressions.MatchCollection c = r.Matches(pFoo);
foreach(System.Text.RegularExpressions.Match m in c)
{
//Console.WriteLine(m.Index.ToString());
}
System.Text.RegularExpressions.Match a;

for(a=r.Match(pFoo);a.Success; a=a.NextMatch())
{
Console.WriteLine(a.Index.ToString() + " " + a.Value);
}

}

Thankyou DeveloperX. I don't get an error with the @ only when i remove
the @.

I removed the @ because when i include it the result isn't what i
expected.

If i run this with the @ and then check the length of arraylist its
3055.

now the sample file im running it on has three of these 'codes' and
three rows of data.

e.g.

HUa82ab8HU272ajHUeje <lots of other text here running over multiple
lines> UNa8723oansjaUNasUNa <more text here running over many lines>
IN8aatjresINiys9aINsa <more text here>

Now i thought i would get all the text between <...> into individual
arraylist elements by running this but i'm not... what am i doing
wrong?

Thankyou
Gary-

I was expecting each part of my arraylist
e.g. [0], [1], ...
to contain everything between a set of codes.




Bizarre, I pasted the regex into my little test app with the @ of
course and it fires

foo3(",,,12tt12ttt12ttttttttt,, ABCCCABCCCCCCCCCABCC ,");

private void foo3(string pFoo)
{
System.Text.RegularExpressions.Regex r = new
System.Text.RegularExpressions.Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Console.WriteLine(r.Matches(pFoo).Count.ToString());
string[] s = r.Split(pFoo);
}

the \1 refers to the first group ]([A-Z][A-Z]) so what this regex is
saying is match a space then XX then any combination of X or N then the
XX found earlier, then more XX or N then our original XX again followed
by more X or N then a space iirc. X - A-Z, N = 0-9.

You might also wish to look at
System.Text.RegularExpressions.RegexOptions enum which sets things like
case sensitivity, multi line support and so forth. As you can see above
I didn't set anything and just took the defaults.

What is the actual error you got?


Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.

The regex doesn't like \1 any suggestions what this should be changed
to?

Thanks,

Gary-

DeveloperX wrote:

try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.

(e-mail address removed) wrote:
I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.

Im getting an error at Regex.Split(...)

Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);

MessageBox.Show(filecontents.Length.ToString());

Thankyou
 
O

Oliver Sturm

Hello,
HUa82ab8HU272ajHUeje <lots of other text here running over multiple
lines> UNa8723oansjaUNasUNa <more text here running over many lines>
IN8aatjresINiys9aINsa <more text here>

Now i thought i would get all the text between <...> into individual
arraylist elements by running this but i'm not... what am i doing
wrong?

An obvious thing could be to use RegexOptions.IgnoreCase in your call to
Split() - your original delimiter didn't have any lower case characters,
but those you're posting now do.

Apart from that - either describe in much more detail how your code works
now and what result you're actually getting, or post or mail something
that lets us reproduce the problem ourselves.


Oliver Sturm
 
G

garyusenet

Emailed a sample, thanks very much.

Oliver said:
Hello,


No, using the @ should be just fine, I usually do that myself.


That's if you don't use the @, right?

I'm not really sure what the problem might be - of course my expression is
working with a lot of assumptions that you and I have been making in this
discussion, so accordingly there may be a lot of reasons why you're not
"getting the desired results" :)

I checked that my expression worked with the delimiter string you
previously posted, but nothing else of course. If you can post further
examples of the delimiter string, maybe that would help... otherwise, feel
free to send me a sample program or a sample data file by email (I think
attachments can't be posted to this group?) or something and I'll have a
look.


Oliver Sturm
 
C

Chris Dunaway

Hi it's used in a custom written programme where I work which is dos
based.
The developers have long since dissapeared.

I'd really like to know how to achieve this in code if possible,

That's why I suggested posting a few "records" from this file so we
could see it and maybe help determine its format. Does the file start
with this data immediately or is there any header data in the beginning
of the file?

 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top