Splitting a large string variable into lines <= 70 chars

D

Daren

Hi,

I need to be able to split large string variables into an array of
lines, each line can be no longer than 70 chars.

The string variables are text, so I would additionally like the lines
to end at the end of a word, if you catch my drift.

For example, I have a large string variable containing the text:
"I've seen things you people wouldn't believe. Attack ships on fire
off the Shoulder of Orion. I watched C-beams glistening in the
moonlight at the Tannhauser Gate. All these moments will be lost, like
tears in rain. Time to die."

Now, with a limit of 70 chars per line (and lines *must* end with a
completed word), I want this to appear like this:
"I've seen things you people wouldn't believe. Attack ships on fire"
"off the Shoulder of Orion. I watched C-beams glistening in the"
"moonlight at the Tannhauser Gate. All these moments will be lost,"
"like tears in rain. Time to die."

ie, split into an array of x lines.

Thanks,

Daren
 
C

cj

Just some thoughts off hand and I don't know the commands only that they
exist. Locate the first space from the right starting at 70 in the
input string. You then can take everything before that as the first
string. Everything after that becomes the new input string. Repeat
until the input string is less than 70.

Or count one char at a time through the input string until you get to 70
saving the position of the last space you find.

Or you might split the input string with the split command giving it the
space as the delimiter. then stack the words into new strings checking
that they don't add to more than 70 in each string.
 
C

CMM

You're looking for a line wrapping algorithm. I don't know if there is a
built-in function to do it for you.... this is fun and a good excerise to
try and come up with on your own without help. It's not hard.

Basically you split your text into an array of words. Fill a string with the
words until you determine that adding the next word would surpass the length
limit, make a typewriter DING sound in your head, add the string to your
lines array, move on to the next line.
 
C

Cerebrus

Hi Daren,

I tried out one of CJ's excellent suggestions(the first one, actually)
and it works for me. This seems to be the most efficient method to me.
As CMM said, it's an interesting exercise to try out yourself.

If you're still stuck, here's the code.

(I'm using a preferable Line Length of 65, since we need to search for
a space after this length. This means that each line is about 65-75
chars in length, on average, depending on how big the last word is.)

============================================

Dim LineLength As Integer = 50
Dim currPos As Integer
Dim theText As String = "My Large String goes here"
Dim thisLine As String
Dim allLines As New StringBuilder()
'Locate the first space after specified no. of chars.(LineLength)
While theText.Length > LineLength
'Locate the first space after 70 chars.
currPos = theText.IndexOf(" ", LineLength)
If currPos > -1 Then
'Get all the text from start of string to currPos
thisLine = theText.Substring(0, currPos + 1)
'Remove this extracted part from the original string too.
theText = theText.Remove(0, currPos + 1)
'Append this line and a CrLf to the StringBuilder
allLines.Append(thisLine)
allLines.Append(vbCrLf)
End If
End While
'Append the remaining part of the text(last line) to the StringBuilder
allLines.Append(theText)
'Display the Text in a Multiline Textbox
TextBox1.Text = allLines.ToString()

============================================

HTH,

Regards,

Cerebrus.
 
C

cj

Cerebrus,

Thanks for the kind words about my ideas. I knew there was some reason
I'm still employed. And I doubt it's for my knowledge of VB.net :)

Actually I was thinking in suggestion #1 of using the InStrRev function
to locate the last space before the 70th char. (I had to go look it
up--to write this reply) It might be an older command from VB6 era but
it is in the .net help.

I've done a lot of string manipulation in my career. Unfortunately not
much in VB.
 
C

Cerebrus

Lol, you're welcome, CJ.
I knew there was some reason I'm still employed. And I doubt it's for my knowledge of VB.net :)

Well, I'm not yet employed. Still looking for a job ! ;-(

The InStrRev function seems perfect for the job in this case. I tried
to find a .NET equivalent, but nothing else will do the job in this
situation. (Since we're breaking the string *after* finding the space.)

Just a reminder for anyone planning to use similar code, the InStrRev
function returns a 1-based index, so you'd need to increment the index
by 1 more when using the Substring method. In my code, I used "currPos
+ 1" to get the trailing space as well into the substring. (Forgot to
trim it later !)

Regards,

Cerebrus.
 
P

Peter Macej

The InStrRev function seems perfect for the job in this case. I tried
to find a .NET equivalent, but nothing else will do the job in this

There is String.LastIndexOf method which does the same as InStrRev.
 
C

Cerebrus

Hi Peter,

I did consider the String.LastIndexOf() method, but it didn't seem
suited for the job, since in this case (if you analyse the original
question), we need to *start* searching backwards from the 70th
character for the *first* space. While, the LastIndexOf function will
search *forward* for the *last* space.

Since we break the string, only after searching for the space,
LastIndexOf didn't seem appropriate. If the String.IndexOf() function
had a "direction" parameter, it could have been used.

Please let me know if you can think of a way to do it using .NET
functions.

Regards,

Cerebrus.
 
P

Peter Macej

While, the LastIndexOf function will
search *forward* for the *last* space.

Sorry, but that's wrong. LastIndexOf searches BACKWARDS. There is also
overloaded method with starting index, in your case 70:
String.LastIndexOf Method (String, Int32)
see
http://msdn.microsoft.com/library/d...l/frlrfsystemstringclasslastindexoftopic4.asp

From the documentation:
"The search begins at the startIndex character position of this instance
and proceeds backwards towards the beginning until either value is found
or the first character position has been examined."
 
C

Cerebrus

Oops ! It seems I missed that part. Thank you so much for that
correction, Peter.

I stand corrected. :)

Regards,

Cerebrus.
 
C

cj

Sorry to hear about your employment situation. I've been there too.
The IT job market is still tough in a lot of areas. You know .net well
and that will help.
 
C

cj

Yep, Peter located the .net replacement. I knew someone would. Now to
remember that for when I need to use it.
 
J

Joergen Bech

Sorry, but that's wrong. LastIndexOf searches BACKWARDS. There is also
overloaded method with starting index, in your case 70:
String.LastIndexOf Method (String, Int32)
see
http://msdn.microsoft.com/library/d...l/frlrfsystemstringclasslastindexoftopic4.asp

From the documentation:
"The search begins at the startIndex character position of this instance
and proceeds backwards towards the beginning until either value is found
or the first character position has been examined."

LastIndexOf goes through quite a lot of validation before finally
making it to an InternalCall to the CLR.

Since you are only looking for a single space and you are likely
to find it (on average) within 5-6 iterations(?), I think the fastest
approach is to search for it "manually", looping back from pos
70 of each line down to the first space.

Another comment about your source sample: In your sample,
you remove the lines you found from the original string.
This is easy to read, but is likely to be costly in terms of
performance.
Instead, do not modify the original string at all during the loop,
but just keep track of where your next line begins, i.e. last
line end found becomes the next line start position.
If you are just testing it with a single paragraph or page,
you are unlikely to see any effect of this optimization, but
if you are writing an eBook converter or high-volume data
import function, it could be noticable.

/JB
 
J

Joergen Bech

Another comment about your source sample: In your sample,
you remove the lines you found from the original string.
This is easy to read, but is likely to be costly in terms of
performance.
Instead, do not modify the original string at all during the loop,
but just keep track of where your next line begins, i.e. last
line end found becomes the next line start position.
If you are just testing it with a single paragraph or page,
you are unlikely to see any effect of this optimization, but
if you are writing an eBook converter or high-volume data
import function, it could be noticable.

Two more comments:

1) After you find your lines, make sure you trim them.

2) Make sure your algorithm handles lines with
"words" that are longer than the line length specified.
I haven't checked, but I am fairly sure the posted
sample would enter an infinite loop if such a beast
was encountered.
Yes, this could happen. Or have you never seen something
like

klajsdflkajsdflkjasdklfjaslkjdflkjasdlkjfkasdfjklasdjklflkjadskljfklasdflkjlkjasdflkjlkajsdfljkasjkldfjklalsdkjflkjasdflkjalkjsdflkjalskjfdlkjasdlkjfjkladflkjalkjsdflkjasdfljkalsdkjfjlafdljkakljfd

in a text file?

/JB
 
C

cj

I kinda agree with you that perhaps the string search methods and
functions like InStrRev and LastIndexOf will not be the fastest way.

Also perhaps not the fastest way but I'm impressed with the, new to me
at least, split command and can see this as parsing the whole thing out
into words then adding up words.

Still, only Daren knows how fast it needs to be. Many times the
difference isn't that much. Many times for me it comes down to what I
understand best. For me if it works usually everyone is happy.
Something like this if I had the time might intrigue me to test it all 3
ways on a huge chunk of data. I'd have the program time itself.

It goes without saying you are correct of course on the need for error
detection. Interesting you should point out "words" larger than 70.
That's an error a lot of folks could overlook but the could occur.

This conversation on the fastest way makes me think of something I've
noticed over the years. Please note, I don't condone this and I have
NOT done this, on purpose, before. Throwing together a slow app that
gets the job done wins you praise for getting the program written
quickly. Wait till they grumble it's slow and then throw in a faster
routine and your a hero again! I heard of a programmer who took this to
the extreme. He built wait loops into his code to purposely slow it
down. Months later when given the project to try to speed up the
processing he say he'd try. Weeks later he was praised for making it so
much faster. All he'd done was reduce the number of iterations his code
spent in the wait loops. Makes you sick doesn't it? Of course this
only works if your the only one that sees the code! I think that's how
he got caught.

What have I learned from these observations and this fellow? People
want the job done NOW. It's what I'm paid for. I do the best I can
making sure it's done within the time alloted. Everyone's happy.
(Except me, I'm rarely happy with my code but the realization that
getting it done even if not the best way IS doing my job helps me cope.)
I then continue to work on the code as I have time until I get it right
and put in the changes. Of course if you follow my lead on this, make
darn sure you are improving things with your changes. You don't want to
introduce bugs into something that's working.
 
J

Joergen Bech

---snip---
This conversation on the fastest way makes me think of something I've
noticed over the years. Please note, I don't condone this and I have
NOT done this, on purpose, before. Throwing together a slow app that
gets the job done wins you praise for getting the program written
quickly. Wait till they grumble it's slow and then throw in a faster
---snip---

First, write for clarity. Second, measure performance. Third, optimize
if necessary.

As for removing each line from the original string, I was merely
pointing this out because this *is* a common "error" - just as bad
as creating one large string by concatenating many small strings
rather than using the StringBuilder class.

The Split approach would avoid the ">70-characters line" problem.
I am sure the final code would be cleaner, but not shorter than
keeping track of start/end positions and extracting substrings, but
I would guess that performance would be worse.

/JB
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top