extract text from string, leaving numbers and operators

C

CEG

I have a column containing text and numbers with operators, such as:
"4 weeks * 40 hours per week." or
"4 wks * 40 hrs per week = 160 hours." or
"2 wks * 40 hrs per week + 2 wks * 20 hrs per week."

I need to strip out the text and end up with a formula, such as "4*40",
resulting in 160.

Need to recognize +, -, *, /. And need to ignore anthing after =, if there
is one.

The objective is to take what is sometimes a long text string and check the
math in the string.

I'm guessing this will take some fancy code. Any help greatly appreciated.
 
J

Joel

Try this UDF

Function GetResults(Target As String)

Results = ""
For i = 1 To Len(Target)
Select Case Mid(Target, i, 1)
Case "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "*", "+", "/",
"-"
Results = Results & Mid(Target, i, 1)
End Select
Next i

GetResults = Evaluate(Results)
End Function
 
C

CEG

This works well except when there is a result entered (i.e. 4 * 40 = 160).
If some goober enters 4 * 40 = 140, I want to be able to spot it without my
calculator.

Thanks for your help so far!!!
 
C

CEG

One more glitch. What if there are decimals in the numbers? As the formula
is now, .5 becomes 5...not good.
 
R

Ron Rosenfeld

I have a column containing text and numbers with operators, such as:
"4 weeks * 40 hours per week." or
"4 wks * 40 hrs per week = 160 hours." or
"2 wks * 40 hrs per week + 2 wks * 20 hrs per week."

I need to strip out the text and end up with a formula, such as "4*40",
resulting in 160.

Need to recognize +, -, *, /. And need to ignore anthing after =, if there
is one.

The objective is to take what is sometimes a long text string and check the
math in the string.

I'm guessing this will take some fancy code. Any help greatly appreciated.


==================
Option Explicit
Function foo(sStr)
Dim i As Long
Dim sTemp As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.Pattern = "[^-\d+*/^=]"
sTemp = re.Replace(sStr, "")
re.Pattern = "=.*"
foo = Evaluate(re.Replace(sTemp, ""))
End Function
==========================
--ron
 
C

CEG

Thanks Ron. This does ignore what comes after the =, which is great, but I
still have the problem of numbers with decimals. See my last post before
this one.
--
CG


Ron Rosenfeld said:
I have a column containing text and numbers with operators, such as:
"4 weeks * 40 hours per week." or
"4 wks * 40 hrs per week = 160 hours." or
"2 wks * 40 hrs per week + 2 wks * 20 hrs per week."

I need to strip out the text and end up with a formula, such as "4*40",
resulting in 160.

Need to recognize +, -, *, /. And need to ignore anthing after =, if there
is one.

The objective is to take what is sometimes a long text string and check the
math in the string.

I'm guessing this will take some fancy code. Any help greatly appreciated.


==================
Option Explicit
Function foo(sStr)
Dim i As Long
Dim sTemp As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.Pattern = "[^-\d+*/^=]"
sTemp = re.Replace(sStr, "")
re.Pattern = "=.*"
foo = Evaluate(re.Replace(sTemp, ""))
End Function
==========================
--ron
 
C

CEG

I figured it out...added a decimal in the Pattern list. Thanks!!!!!
--
CG


CEG said:
Thanks Ron. This does ignore what comes after the =, which is great, but I
still have the problem of numbers with decimals. See my last post before
this one.
--
CG


Ron Rosenfeld said:
I have a column containing text and numbers with operators, such as:
"4 weeks * 40 hours per week." or
"4 wks * 40 hrs per week = 160 hours." or
"2 wks * 40 hrs per week + 2 wks * 20 hrs per week."

I need to strip out the text and end up with a formula, such as "4*40",
resulting in 160.

Need to recognize +, -, *, /. And need to ignore anthing after =, if there
is one.

The objective is to take what is sometimes a long text string and check the
math in the string.

I'm guessing this will take some fancy code. Any help greatly appreciated.


==================
Option Explicit
Function foo(sStr)
Dim i As Long
Dim sTemp As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.Pattern = "[^-\d+*/^=]"
sTemp = re.Replace(sStr, "")
re.Pattern = "=.*"
foo = Evaluate(re.Replace(sTemp, ""))
End Function
==========================
--ron
 
R

Ron Rosenfeld

I figured it out...added a decimal in the Pattern list. Thanks!!!!!

Glad to help. Thanks for the feedback.

Adding a decimal to the pattern list will work so long as there are no decimals
that are part of the text to be eliminated.

If that might be the case, post some examples and I could rework the pattern.

Also, you should be aware that using the decimal within the pattern list (i.e.
[.] works; but if the decimal were to be used outside of the pattern list, it
has a different meaning (it would match any character in that circumstance).
--ron
 
C

CEG

I figured a workaround for the decimal vs period problem. However, I do have
one other issue…is there a way to extract any 7-digit numbers? Sometimes
there is a reference number used that does not belong in the calculation.
--
CG


Ron Rosenfeld said:
I figured it out...added a decimal in the Pattern list. Thanks!!!!!

Glad to help. Thanks for the feedback.

Adding a decimal to the pattern list will work so long as there are no decimals
that are part of the text to be eliminated.

If that might be the case, post some examples and I could rework the pattern.

Also, you should be aware that using the decimal within the pattern list (i.e.
[.] works; but if the decimal were to be used outside of the pattern list, it
has a different meaning (it would match any character in that circumstance).
--ron
 
R

Ron Rosenfeld

I figured a workaround for the decimal vs period problem. However, I do have
one other issue…is there a way to extract any 7-digit numbers? Sometimes
there is a reference number used that does not belong in the calculation.

Yes there is.

Some examples would be useful, but, in general, you should be able to just add
to Pattern and argument representing seven digits in a row.

Is the reference number separated from the rest of the string by a <space>?

You could use something like the pattern below:

"(=.*)|[^-\d+*/^.]|(\b\d{7}\b)"

Please note that I also removed the "=" sign from the list of things to retain,
and chose to delete the "=" and everything that follows it. This may or may
not make a difference, but did in one of your examples where you had an "="
sign.
--ron
 
C

CEG

The 7-digit fix worked great. Deleting "=" and everything after is a good
change too...thanks.

Can I extract the "/" character also? Some people use it in
abbreviations…such as "w/out" for "without." It will never be used for
division in my data.

It would be great if you could explain what each step of this formula is
doing. I hate to keep asking you for help in tweaking it.

Here are some examples, although the variety of actual data is too much to
list:

0.045 Fabrication hours per system * 100 Long Machines + 0.045 * 2 Short
Machines

6 pages (one table & one Illustration) * 3.9hrs per page = 24 hrs

The 6 hours represent actual labor expenditure for I/O 5024549. LOE for Jan
09 is 3 hrs total effort. (this one creates a problem with the "09"; I can
replace the "." with a "+" and get 9 hrs)

I don't expect to be able to resolve everything, but if I can automatically
check calculations on a majority of the data, it can cut hours & hours of
work per project. What you've given me so far is getting about 75%
results...which is great!
--
CG


Ron Rosenfeld said:
I figured a workaround for the decimal vs period problem. However, I do have
one other issue…is there a way to extract any 7-digit numbers? Sometimes
there is a reference number used that does not belong in the calculation.

Yes there is.

Some examples would be useful, but, in general, you should be able to just add
to Pattern and argument representing seven digits in a row.

Is the reference number separated from the rest of the string by a <space>?

You could use something like the pattern below:

"(=.*)|[^-\d+*/^.]|(\b\d{7}\b)"

Please note that I also removed the "=" sign from the list of things to retain,
and chose to delete the "=" and everything that follows it. This may or may
not make a difference, but did in one of your examples where you had an "="
sign.
--ron
 
R

Ron Rosenfeld

The 7-digit fix worked great. Deleting "=" and everything after is a good
change too...thanks.

Can I extract the "/" character also? Some people use it in
abbreviations…such as "w/out" for "without." It will never be used for
division in my data.

You just remove it from this part of the pattern where it is listed in the
"character class": [[^-\d+*/^.] --> [^-\d+*^.]
It would be great if you could explain what each step of this formula is
doing. I hate to keep asking you for help in tweaking it.

I don't mind helping you along, but hopefully the references below will help
you get a grasp on this.


The function is based on Regular Expressions.

For information on how to implement this in VBA, see:

http://support.microsoft.com/default.aspx?scid=kb;en-us;818802&Product=vbb
http://msdn2.microsoft.com/en-us/library/6wzad2b2.aspx

For basic Regular Expression syntax, see:

http://www.regular-expressions.info/reference.html

Here are some examples, although the variety of actual data is too much to
list:

0.045 Fabrication hours per system * 100 Long Machines + 0.045 * 2 Short
Machines

6 pages (one table & one Illustration) * 3.9hrs per page = 24 hrs

The 6 hours represent actual labor expenditure for I/O 5024549. LOE for Jan
09 is 3 hrs total effort. (this one creates a problem with the "09"; I can
replace the "." with a "+" and get 9 hrs)

I don't expect to be able to resolve everything, but if I can automatically
check calculations on a majority of the data, it can cut hours & hours of
work per project. What you've given me so far is getting about 75%
results...which is great!

Well, looking at all the examples you have provided, this Pattern seems to do
the job. I have split it onto several lines so that when you copy it into the
module, you will not have problems with extra spaces or line breaks, but it
could be written all on one line as a single expression:

re.Pattern = "=[\s\S]*|\d{7}|^[^-+*]*$|\.(?!\d)|" & _
"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)" & _
"\D*\d{1,2}|[^-+*\d.]"


In brief, this expression looks for, and deletes (in order):

1. Any "=" sign and everything following it in the cell
2. Any series of exactly seven digits
3. Any complete line which does not contain any arithmetic operators.
4. Any "." that is not immediately followed by a digit.
5. Any sequence that starts with a month name and ends with one or two digits.
(note that if you could have a sequence such as Jan 2009, you might
want to change the sequence \d{1,2} to \d{1,4} )
6. Anything else that is not a digit or a minus, or a plus, or an asterisk, or
a decimal (".")

The different groupings in the pattern are pipe-delimited.

Here is my current version of this UDF, with some annotations.
===============================
Option Explicit
Function foo(sStr) As Double
Dim sTemp As String

'set up the regular expression object and
'specify the Global and Pattern properties
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.Pattern = "=[\s\S]*|\d{7}|^[^-+*]*$|\.(?!\d)|" & _
"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)" & _
"\D*\d{1,2}|[^-+*\d.]"

'Make sTemp equal to the original function argument with the
'substitutions carried out
sTemp = re.Replace(sStr, "")

'Use the Evaluate method to make foo equal to the desired result
foo = Evaluate(re.Replace(sTemp, ""))
End Function
================================

Without the annotations, it's pretty short, for all it does:

=====================================
Option Explicit
Function foo(sStr) As Double
Dim sTemp As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.Pattern = "=[\s\S]*|\d{7}|^[^-+*]*$|\.(?!\d)|" & _
"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)" & _
"\D*\d{1,2}|[^-+*\d.]"
sTemp = re.Replace(sStr, "")
foo = Evaluate(re.Replace(sTemp, ""))
End Function
==================================

Here is a formal explanation of the Regular Expression. It probably won't make
much sense until you've had a chance to review the above references:

Extract

=[\s\S]*|\d{7}|^[^-+*]*$|\.(?!\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\D*\d{1,2}|[^-+*\d.]

Match either the regular expression below (attempting the next alternative only
if this one fails) «=[\s\S]*»
Match the character “=” literally «=»
Match a single character present in the list below «[\s\S]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
Any character that is NOT a whitespace character «\S»
Or match regular expression number 2 below (attempting the next alternative
only if this one fails) «\d{7}»
Match a single digit 0..9 «\d{7}»
Exactly 7 times «{7}»
Or match regular expression number 3 below (attempting the next alternative
only if this one fails) «^[^-+*]*$»
Assert position at the beginning of the string «^»
Match a single character NOT present in the list “-+*” «[^-+*]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
Assert position at the end of the string (or before the line break at the
end of the string, if any) «$»
Or match regular expression number 4 below (attempting the next alternative
only if this one fails) «\.(?!\d)»
Match the character “.” literally «\.»
Assert that it is impossible to match the regex below starting at this
position (negative lookahead) «(?!\d)»
Match a single digit 0..9 «\d»
Or match regular expression number 5 below (attempting the next alternative
only if this one fails)
«(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\D*\d{1,2}»
Match the regular expression below and capture its match into backreference
number 1 «(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)»
Match either the regular expression below (attempting the next
alternative only if this one fails) «Jan»
Match the characters “Jan” literally «Jan»
Or match regular expression number 2 below (attempting the next
alternative only if this one fails) «Feb»
Match the characters “Feb” literally «Feb»
Or match regular expression number 3 below (attempting the next
alternative only if this one fails) «Mar»
Match the characters “Mar” literally «Mar»
Or match regular expression number 4 below (attempting the next
alternative only if this one fails) «Apr»
Match the characters “Apr” literally «Apr»
Or match regular expression number 5 below (attempting the next
alternative only if this one fails) «May»
Match the characters “May” literally «May»
Or match regular expression number 6 below (attempting the next
alternative only if this one fails) «Jun»
Match the characters “Jun” literally «Jun»
Or match regular expression number 7 below (attempting the next
alternative only if this one fails) «Jul»
Match the characters “Jul” literally «Jul»
Or match regular expression number 8 below (attempting the next
alternative only if this one fails) «Aug»
Match the characters “Aug” literally «Aug»
Or match regular expression number 9 below (attempting the next
alternative only if this one fails) «Sep»
Match the characters “Sep” literally «Sep»
Or match regular expression number 10 below (attempting the next
alternative only if this one fails) «Oct»
Match the characters “Oct” literally «Oct»
Or match regular expression number 11 below (attempting the next
alternative only if this one fails) «Nov»
Match the characters “Nov” literally «Nov»
Or match regular expression number 12 below (the entire group fails if
this one fails to match) «Dec»
Match the characters “Dec” literally «Dec»
Match a single character that is not a digit 0..9 «\D*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
Match a single digit 0..9 «\d{1,2}»
Between one and 2 times, as many times as possible, giving back as needed
(greedy) «{1,2}»
Or match regular expression number 6 below (the entire match attempt fails if
this one fails to match) «[^-+*\d.]»
Match a single character NOT present in the list below «[^-+*\d.]»
One of the characters “-+*” «-+*»
A single digit 0..9 «\d»
The character “.” «.»


Created with RegexBuddy
--ron
 
C

CEG

Finally have time to return to this. Thanks for the step-by-step and the
links...I now understand a little bit more about Regular Expressions...could
maybe even write a simple one!
--
CG


Ron Rosenfeld said:
The 7-digit fix worked great. Deleting "=" and everything after is a good
change too...thanks.

Can I extract the "/" character also? Some people use it in
abbreviations…such as "w/out" for "without." It will never be used for
division in my data.

You just remove it from this part of the pattern where it is listed in the
"character class": [[^-\d+*/^.] --> [^-\d+*^.]
It would be great if you could explain what each step of this formula is
doing. I hate to keep asking you for help in tweaking it.

I don't mind helping you along, but hopefully the references below will help
you get a grasp on this.


The function is based on Regular Expressions.

For information on how to implement this in VBA, see:

http://support.microsoft.com/default.aspx?scid=kb;en-us;818802&Product=vbb
http://msdn2.microsoft.com/en-us/library/6wzad2b2.aspx

For basic Regular Expression syntax, see:

http://www.regular-expressions.info/reference.html

Here are some examples, although the variety of actual data is too much to
list:

0.045 Fabrication hours per system * 100 Long Machines + 0.045 * 2 Short
Machines

6 pages (one table & one Illustration) * 3.9hrs per page = 24 hrs

The 6 hours represent actual labor expenditure for I/O 5024549. LOE for Jan
09 is 3 hrs total effort. (this one creates a problem with the "09"; I can
replace the "." with a "+" and get 9 hrs)

I don't expect to be able to resolve everything, but if I can automatically
check calculations on a majority of the data, it can cut hours & hours of
work per project. What you've given me so far is getting about 75%
results...which is great!

Well, looking at all the examples you have provided, this Pattern seems to do
the job. I have split it onto several lines so that when you copy it into the
module, you will not have problems with extra spaces or line breaks, but it
could be written all on one line as a single expression:

re.Pattern = "=[\s\S]*|\d{7}|^[^-+*]*$|\.(?!\d)|" & _
"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)" & _
"\D*\d{1,2}|[^-+*\d.]"


In brief, this expression looks for, and deletes (in order):

1. Any "=" sign and everything following it in the cell
2. Any series of exactly seven digits
3. Any complete line which does not contain any arithmetic operators.
4. Any "." that is not immediately followed by a digit.
5. Any sequence that starts with a month name and ends with one or two digits.
(note that if you could have a sequence such as Jan 2009, you might
want to change the sequence \d{1,2} to \d{1,4} )
6. Anything else that is not a digit or a minus, or a plus, or an asterisk, or
a decimal (".")

The different groupings in the pattern are pipe-delimited.

Here is my current version of this UDF, with some annotations.
===============================
Option Explicit
Function foo(sStr) As Double
Dim sTemp As String

'set up the regular expression object and
'specify the Global and Pattern properties
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.Pattern = "=[\s\S]*|\d{7}|^[^-+*]*$|\.(?!\d)|" & _
"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)" & _
"\D*\d{1,2}|[^-+*\d.]"

'Make sTemp equal to the original function argument with the
'substitutions carried out
sTemp = re.Replace(sStr, "")

'Use the Evaluate method to make foo equal to the desired result
foo = Evaluate(re.Replace(sTemp, ""))
End Function
================================

Without the annotations, it's pretty short, for all it does:

=====================================
Option Explicit
Function foo(sStr) As Double
Dim sTemp As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.Pattern = "=[\s\S]*|\d{7}|^[^-+*]*$|\.(?!\d)|" & _
"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)" & _
"\D*\d{1,2}|[^-+*\d.]"
sTemp = re.Replace(sStr, "")
foo = Evaluate(re.Replace(sTemp, ""))
End Function
==================================

Here is a formal explanation of the Regular Expression. It probably won't make
much sense until you've had a chance to review the above references:

Extract

=[\s\S]*|\d{7}|^[^-+*]*$|\.(?!\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\D*\d{1,2}|[^-+*\d.]

Match either the regular expression below (attempting the next alternative only
if this one fails) «=[\s\S]*»
Match the character “=†literally «=»
Match a single character present in the list below «[\s\S]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
Any character that is NOT a whitespace character «\S»
Or match regular expression number 2 below (attempting the next alternative
only if this one fails) «\d{7}»
Match a single digit 0..9 «\d{7}»
Exactly 7 times «{7}»
Or match regular expression number 3 below (attempting the next alternative
only if this one fails) «^[^-+*]*$»
Assert position at the beginning of the string «^»
Match a single character NOT present in the list “-+*†«[^-+*]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
Assert position at the end of the string (or before the line break at the
end of the string, if any) «$»
Or match regular expression number 4 below (attempting the next alternative
only if this one fails) «\.(?!\d)»
Match the character “.†literally «\.»
Assert that it is impossible to match the regex below starting at this
position (negative lookahead) «(?!\d)»
Match a single digit 0..9 «\d»
Or match regular expression number 5 below (attempting the next alternative
only if this one fails)
«(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\D*\d{1,2}»
Match the regular expression below and capture its match into backreference
number 1 «(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)»
Match either the regular expression below (attempting the next
alternative only if this one fails) «Jan»
Match the characters “Jan†literally «Jan»
Or match regular expression number 2 below (attempting the next
alternative only if this one fails) «Feb»
Match the characters “Feb†literally «Feb»
Or match regular expression number 3 below (attempting the next
alternative only if this one fails) «Mar»
Match the characters “Mar†literally «Mar»
Or match regular expression number 4 below (attempting the next
alternative only if this one fails) «Apr»
Match the characters “Apr†literally «Apr»
Or match regular expression number 5 below (attempting the next
alternative only if this one fails) «May»
Match the characters “May†literally «May»
Or match regular expression number 6 below (attempting the next
alternative only if this one fails) «Jun»
Match the characters “Jun†literally «Jun»
Or match regular expression number 7 below (attempting the next
alternative only if this one fails) «Jul»
Match the characters “Jul†literally «Jul»
Or match regular expression number 8 below (attempting the next
alternative only if this one fails) «Aug»
Match the characters “Aug†literally «Aug»
Or match regular expression number 9 below (attempting the next
alternative only if this one fails) «Sep»
Match the characters “Sep†literally «Sep»
Or match regular expression number 10 below (attempting the next
alternative only if this one fails) «Oct»
Match the characters “Oct†literally «Oct»
Or match regular expression number 11 below (attempting the next
alternative only if this one fails) «Nov»
Match the characters “Nov†literally «Nov»
Or match regular expression number 12 below (the entire group fails if
this one fails to match) «Dec»
Match the characters “Dec†literally «Dec»
Match a single character that is not a digit 0..9 «\D*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
Match a single digit 0..9 «\d{1,2}»
Between one and 2 times, as many times as possible, giving back as needed
(greedy) «{1,2}»
Or match regular expression number 6 below (the entire match attempt fails if
this one fails to match) «[^-+*\d.]»
Match a single character NOT present in the list below «[^-+*\d.]»
One of the characters “-+*†«-+*»
A single digit 0..9 «\d»
The character “.†«.»


Created with RegexBuddy
--ron
 
R

Ron Rosenfeld

Finally have time to return to this. Thanks for the step-by-step and the
links...I now understand a little bit more about Regular Expressions...could
maybe even write a simple one!

Once you gain some fluency, many operations can be developed more quickly using
them.
--ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top