How to trim a string to the last "."

V

VCKW

I have 11,000 long text strings in Excel... I had used LEFT operation to
limit the strings to 500 charaters, now I need to trim the strings to the
last available "." so that I can delete those uncompleted sentence after the
last "."

I tried to use "Text to Column", but it does not give me good result...

Please HELP!!

Thank you!
 
R

Ron Coderre

With the original text in A1

This formula (shown in segments for readability) shortens that string to 500
characters, then truncates it after the last period:
B1: =LEFT(LEFT(A1,500),FIND(CHAR(7),
SUBSTITUTE(LEFT(A1,500),".",CHAR(7),LEN(LEFT(A1,500))-
LEN(SUBSTITUTE(LEFT(A1,500),".",""))))+1)

Copy that formula down as far as you need.

Is that something you can work with?
--------------------------

Regards,

Ron
Microsoft MVP (Excel)
(XL2003, Win XP)
 
R

Rick Rothstein \(MVP - VB\)

I realize that, using a maximum of 500 characters, you probably will never
run into the situation where there are no periods in the (initially)
truncated string of text; however, if you should ever want to modify this
formula for use in a situation where that possibility exists (that is, if
you ever need to change the 500 to something much, much less), then the
following function will return the string of text, truncated to 500
characters, if there is no periods within the first 500 characters; also, if
A1 is empty, the formula returns the empty string.

=LEFT(A1,FIND(CHAR(7),SUBSTITUTE(LEFT(A1,500)&".",".",CHAR(7),NOT(ISNUMBER(FIND(".",LEFT(A1,500))))+LEN(LEFT(A1,500))-LEN(SUBSTITUTE(LEFT(A1,500),".","")))))

Although the formula is similar to the one Ron posted, it is built slightly
differently to account for the functionality I outlined above; however, I
did use the same CHAR(7) substitution character that Ron used.

Rick
 
R

Ron Rosenfeld

This formula (shown in segments for readability) shortens that string to 500
characters, then truncates it after the last period:
B1: =LEFT(LEFT(A1,500),FIND(CHAR(7),
SUBSTITUTE(LEFT(A1,500),".",CHAR(7),LEN(LEFT(A1,500))-
LEN(SUBSTITUTE(LEFT(A1,500),".",""))))+1)

Your formula also returns the character following the ".". Suggest you omit
the last "+1"


--ron
 
R

Ron Coderre

Thanks, Ron. Good catch.

--------------------------

Best Regards,

Ron
Microsoft MVP (Excel)
(XL2003, Win XP)
 
R

Rick Rothstein \(MVP - VB\)

You already got some solutions, but just for the fun of it:
=regexpreplace(A1,"^(.*\.)?.*$","$1")

This UDF you can get here:
http://www.sulprobil.com/html/regexp.html

Well, if you are looking for a UDF type solution, this one-liner would do
what your code does, as it applies to the OP's request (assuming you added
the part to initially truncate the text at 500 characters before looking for
the last period)...

Function TruncateAtPeriod(ByVal Source As String, _
Optional TrimAt As Long = 30000) As String
TruncateAtPeriod = Left(Source, InStrRev(Left(Source, TrimAt), "."))
End Function

where the OP would call it like this...

=TruncateAtPeriod(A1,500)

Both of our code returns the empty string if, in the unlikely event, no
periods exist within the text (in my case, prior to the 500th character), so
additional code would be needed in order to return the whole string of text
(assuming that is what the OP would want to happen when no periods are found
within the text).

Rick
 
B

Bernd P

Hello Rick,

....
additional code would be needed in order to return the whole string of text
(assuming that is what the OP would want to happen when no periods are found
within the text).
....

Less code in my example: just omit the "?" :)

Regards,
Bernd
 
R

Rick Rothstein \(MVP - VB\)

additional code would be needed in order to return the whole string of
...

Less code in my example: just omit the "?" :)

Well, I wasn't talking about a *lot* of additional code. Still a
one-liner....

Function TruncateAtPeriod(ByVal Source As String, _
Optional TrimAt As Long = 30000) As String
TruncateAtPeriod = Left(Source, InStrRev(Left(Source, TrimAt), ".") - _
Len(Source) * (Not Source Like "*.*"))
End Function

Besides, you still have to include *more* code somewhere to perform the
initial truncation (to 500 characters).<g>

Rick
 
R

Ron Rosenfeld

Hello Rick,

...
...

Less code in my example: just omit the "?" :)

Regards,
Bernd

Bernd,

A couple of points.

Your "code" is not really shorter if you include the VBA code required for the
UDF.

But, more importantly, I don't believe your code will work with a multiline
text string because of the limitations of the VBScript flavor of regex.

In particular, there is no provision, in VBScript to enable <dot matches
newline>.

OK, I just downloaded the UDF and tested it and, indeed, it does not work on
multiline text strings.

If you want to use regular expressions for this exercise, I would suggest the
following UDF with a regex constructed according to the maximum length of the
string (500 in this case, but there is an optional argument to change that:

=========================================
Option Explicit
Function reTruncateAtDot(str As String, _
Optional NumChars As Long = 500) As String
Dim re As Object, mc As Object
Dim sPat As String

sPat = "^[\s\S]{1," & NumChars & "}\."
Set re = CreateObject("vbscript.regexp")
re.Pattern = sPat
Set mc = re.Execute(str)
reTruncateAtDot = mc(0)
End Function
==============================

The OP could use either:

=reTruncateAtDot(A1) which would truncate at the last dot prior to the 501st
character, or use an optional NumChars to change that.

The regex with the 500 characters "hard-coded" would look like:

"^[\s\S]{1,500}\."

If you wanted to use your formula, and have it work with multiline text
strings, then try:

=regexpreplace(A1,"^([\s\S]*\.)?[\s\S]*$","$1")

But you'd still need the second step to trim to 500 characters.

Oh, and in my contribution, since the OP did not specify what he wants to
happen if there are no dots in the first 500 characters, a VALUE error will be
returned. That, obviously, could be changed depending on the OP's wishes.
--ron
 
R

Rick Rothstein \(MVP - VB\)

additional code would be needed in order to return the whole string of
A couple of points.

Your "code" is not really shorter if you include the VBA code required for
the
UDF.

That was going roughly going to be my answer when I first read Bernd's
response; but I think, given the trimmed portion of the previous responses
in his posting and leaving aside the 500 character maximum truncation
requested by the OP, he was simply just addressing my comment about needing
*additional* code for my function to be able to return the entire text
string whereas to do the same with his code only required removing the
question mark. Yes, there is the question of the needed UDF code to make his
formula line work as well as the 500 character truncation issue; but I'm
pretty sure Bernd's posting was meant to focus in on this much narrower
issue without addressing anything else.

Rick

But, more importantly, I don't believe your code will work with a
multiline
text string because of the limitations of the VBScript flavor of regex.

In particular, there is no provision, in VBScript to enable <dot matches
newline>.

OK, I just downloaded the UDF and tested it and, indeed, it does not work
on
multiline text strings.

If you want to use regular expressions for this exercise, I would suggest
the
following UDF with a regex constructed according to the maximum length of
the
string (500 in this case, but there is an optional argument to change
that:

=========================================
Option Explicit
Function reTruncateAtDot(str As String, _
Optional NumChars As Long = 500) As String
Dim re As Object, mc As Object
Dim sPat As String

sPat = "^[\s\S]{1," & NumChars & "}\."
Set re = CreateObject("vbscript.regexp")
re.Pattern = sPat
Set mc = re.Execute(str)
reTruncateAtDot = mc(0)
End Function
==============================

The OP could use either:

=reTruncateAtDot(A1) which would truncate at the last dot prior to the
501st
character, or use an optional NumChars to change that.

The regex with the 500 characters "hard-coded" would look like:

"^[\s\S]{1,500}\."

If you wanted to use your formula, and have it work with multiline text
strings, then try:

=regexpreplace(A1,"^([\s\S]*\.)?[\s\S]*$","$1")

But you'd still need the second step to trim to 500 characters.

Oh, and in my contribution, since the OP did not specify what he wants to
happen if there are no dots in the first 500 characters, a VALUE error
will be
returned. That, obviously, could be changed depending on the OP's wishes.
--ron
 
R

Ron Rosenfeld

That was going roughly going to be my answer when I first read Bernd's
response; but I think, given the trimmed portion of the previous responses
in his posting and leaving aside the 500 character maximum truncation
requested by the OP, he was simply just addressing my comment about needing
*additional* code for my function to be able to return the entire text
string whereas to do the same with his code only required removing the
question mark. Yes, there is the question of the needed UDF code to make his
formula line work as well as the 500 character truncation issue; but I'm
pretty sure Bernd's posting was meant to focus in on this much narrower
issue without addressing anything else.

Oh, I see.

Of course, I don't believe his regex will work for multiline text strings, for
the reason I pointed out in my response. And it seems likely to me that a 500
character text string in Excel would likely have at least one linefeed. So he
should substitute "[\s\S]" for the "." in his regex.


--ron
 
R

Rick Rothstein \(MVP - VB\)

That was going roughly going to be my answer when I first read Bernd's
response; but I think, given the trimmed portion of the previous responses
in his posting and leaving aside the 500 character maximum truncation
requested by the OP, he was simply just addressing my comment about
needing
*additional* code for my function to be able to return the entire text
string whereas to do the same with his code only required removing the
question mark. Yes, there is the question of the needed UDF code to make
his
formula line work as well as the 500 character truncation issue; but I'm
pretty sure Bernd's posting was meant to focus in on this much narrower
issue without addressing anything else.

Oh, I see.

Of course, I don't believe his regex will work for multiline text strings,
for
the reason I pointed out in my response. And it seems likely to me that a
500
character text string in Excel would likely have at least one linefeed.
So he
should substitute "[\s\S]" for the "." in his regex.

I'll take your word for the regex stuff... the last time I did anything with
regular expressions was back in the mid-1980s on a UNIX based
mini-computer... I'm afraid I remember next to nothing from back then
(except that it was quite easy to write complex regular expressions whose
operation was perfectly obvious during their construction but that one could
not understand how they worked some 10 minutes or so later on<g>).

Rick
 
B

Bernd P

Hi Ron, hi Rick,

let us agree that regex are quite powerful. Since they are line-
orientated (or we are facing line restrictions) we can wait and see
whether my suggesion will work for the OP.

And now let us wait for the next two dozen questions regarding string
manipulation and let's count the overall sum of UDF rows plus calls to
them (it)...

Regards,
Bernd
 
R

Ron Rosenfeld

(except that it was quite easy to write complex regular expressions whose
operation was perfectly obvious during their construction but that one could
not understand how they worked some 10 minutes or so later on<g>).

That is still the case!
--ron
 
R

Ron Rosenfeld

Hi Ron, hi Rick,

let us agree that regex are quite powerful. Since they are line-
orientated (or we are facing line restrictions) we can wait and see
whether my suggesion will work for the OP.

And now let us wait for the next two dozen questions regarding string
manipulation and let's count the overall sum of UDF rows plus calls to
them (it)...

Regards,
Bernd

Why do you write they are "line-oriented"?

At least in the few flavors with which I am familiar, multiline implementations
are common. The issue here is that with VBA, setting multiline to TRUE (which
is allowed in the UDF you recommended) only changes the behavior of "^" and
"$", and not the behavior of "." as is true for other flavors. Simple enough
to work around, though.
--ron
 
R

Ron Rosenfeld

Why do you write they are "line-oriented"?

At least in the few flavors with which I am familiar, multiline implementations
are common. The issue here is that with VBA, setting multiline to TRUE (which
is allowed in the UDF you recommended) only changes the behavior of "^" and
"$", and not the behavior of "." as is true for other flavors. Simple enough
to work around, though.
--ron

I really should have included the qualification that the original Unix editors,
and regex engines were, indeed, line oriented. However, when I first worked
with Unix (actually I think it was Xenix, if I recall correctly) back in the
early '80's, vi (a screen editor) had already been developed and was pretty
widely distributed.


--ron
 
R

Ron Rosenfeld

Why do you write they are "line-oriented"?

At least in the few flavors with which I am familiar, multiline implementations
are common. The issue here is that with VBA, setting multiline to TRUE (which
is allowed in the UDF you recommended) only changes the behavior of "^" and
"$", and not the behavior of "." as is true for other flavors. Simple enough
to work around, though.
--ron

Just for kicks, if the OP does not want to return an error if there are no dots
in the string, here is a regex that will

1. Return a maximum of 500 characters and, in order of priority
a. Return up to the last "." if there is one
b. If no ".", then split at the last <space>
c. If no "." or <space>, return 500 characters.

(note that it does not return the ending <space> if that is what it is
splitting on. It should work with multiline text strings in VBA:

"^([\s\S]{1,500}\.|[\s\S]{1,500}(?= )|[\s\S]{1,500})"


--ron
 
R

Rick Rothstein \(MVP - VB\)

Just for kicks, if the OP does not want to return an error if there are no
dots
in the string, here is a regex that will

1. Return a maximum of 500 characters and, in order of priority
a. Return up to the last "." if there is one
b. If no ".", then split at the last <space>
c. If no "." or <space>, return 500 characters.

(note that it does not return the ending <space> if that is what it is
splitting on. It should work with multiline text strings in VBA:

"^([\s\S]{1,500}\.|[\s\S]{1,500}(?= )|[\s\S]{1,500})"

I like the "split at last space" condition 'challenge'. A little more
"wordy" than your regex solution, but still a one-liner (my own self-imposed
restriction) VBA function...

Function Truncate(ByVal Source As String, _
Optional TrimAt As Long = 30000) As String
Truncate = RTrim$(Left(Source, InStrRev(Left(Source, TrimAt), ".") - _
(InStrRev(Left(Source, TrimAt), " ")) * _
(Not Left(Source, TrimAt) Like "*.*") - _
(Len(Left(Source, TrimAt))) * _
(Not Left(Source, TrimAt) Like "*[. ]*")))
End Function

Note that I still left the optional TrimAt argument so the OP can set it to
500 (or any other value desired) in the calling formula.

Rick
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top