Need formula to extract a numeric value from a free-format text

E

Eric_NY

Thanks. I'll need to take some time to decipher this. But first, can you
explain:

- What's the function of the double minus sign in the formula?
- Why does the date separator come into play here at all? The only thing I
need to do is to look for a 7-digit number in a text field where we know
nothing at all about the rest of the contents. Why should the operation of a
formula be dependent upon extraneous information, such as how a date is
formatted? The rest of the text string needs to be considered as a fully
arbitrary series of characters; in other words, the function needs to work
entirely irrespective of the remainder of the field.

Thanks for your help.
 
R

Rick Rothstein

First off, before answering your questions, here is a revised formula (the
previous one could not detect an 8 or more digit number in front of the 7
digit number you wanted to find...

=MID(A1,MIN(IF(ISNUMBER(--MID(SUBSTITUTE(A1," ","x"),ROW(1:99),7))*
ISERR(SEARCH("e",MID(A1,ROW(1:99),7)))*ISERR(FIND("/",MID(A1,ROW
(1:99),7)))*ISERROR(--MID(A1,ROW(1:99),8))*IF(ROW(1:99)>1,ISERROR
(--MID(A1,ROW(1:99),ROW(1:99)-1))),ROW(1:99))),7)

I also changed the limit to text strings up to 99 characters long. If you
could have longer text strings for this formula to process, then change
*all* the 99's in my formula to a number that is equal to the largest total
number of characters that your text could be. I also changed the formulas
cell reference from the arbitrary F5 I used in my previous submission to the
more standard A1 cell reference that is used when the actual cell reference
is unknown.

Okay, first off, the double minus sign is a means of converting the text
representation of a number into an actual numeric value. Excel will does
this conversion automatically whenever the text representation of a numeric
value is used in a numerical calculation. Putting the double minus sign in
front force Excel to attempt to multiply the text by minus one twice
(--TextNumber is the same as doing this...

-1*-1*TextNumber

If TextNumber is the actual representation of a numerical value, then that
number will be returned (because minus one times minus one is equivalent to
plus 1); if it is not the actual representation of a numerical value,
attempting to multiply it by the first minus sign (which is equivalent to
minus 1) will generate an error. So my code uses the double minus sign to
attempt to change the text at each point of the array's iteration to see if
ISNUMBER and ISERROR is true or not in order to calculate the values
necessary to retrieve the 7 digit number you are after.

And the above explanation is the key to understanding the second part of
your question. There are two problems with converting the text
representation of a number to an actual value... numbers that are powers of
10 (Excel uses and E, for exponent I'm guessing, to indicate a power of 10;
for example 1.23E4 is the same thing as 1234) and dates (which Excel works
with as offsets for January 1, 1900; so the human readable date of 7/20/2009
is really 40014). So, as the array formula iterates down the text, it will
would see text constructions like these 7 character long pieces of text...
1.23E45 and 07/20/09... as numbers when the double minus sign is applied to
them, so I needed to filter these "false positives" out if they occurred
before your actual 7 digit number. I handled this by looking for an embedded
"e" (either upper or lower case), for the E-Notation problem, or a date
separator symbol which is the slash in my system (but which can vary
depending on regional settings).
 
L

Lori

Thanks for the clarification so the formula i posted above should be robust.
Did you any luck with it?
 
E

Eric_NY

Very clever!

I've seen that double minus before and never understood the point. Thanks
for explaining.

Where in your formula is the logic which checks for the number starting
first in character position 1, then position 2, then 3, etc.? The only way I
could think of was something unwieldy, such as:
IF(ISNUMBER(MID(A1,1,7)),MID(A1,1,7),IF(ISNUMBER(MID(A1,2,7)),MID(A1,2,7)....
)))))))). Obviously that's not a great solution.

Yours is obviously more compact and better - if I can figure out how it works!

Thanks again for your help.
 
R

Ron Rosenfeld

I used the regex solution that Ron Rosenfeld suggested, and adjusted the
regular expression by removing the "\b" before and after the "\d{7}".


If you remove the "\b"'s, then the regex solution will return, for example, 7
digit portions of 8 digit numbers; or if you have an entry like NH123456789, it
will return 1234567. Is that what you want?
--ron
 
E

Eric_NY

I see what you mean.

The problem I was getting was that the "\b" was excluding some cases that I
found in my data (such as "SR1234567" and "1234567remedy"). there were other
cases with a non-space immediately adjacent to the 7-digit sequence. So I
just took out the "\b" part of the pattern.

(I was also embarrassed when I presented the results and realized that my
7-digit numbers actually began at 987262 - i.e., a 6-digit number, so not all
of them were in fact 7 digits. So my logic was wrong and I missed one that I
should have found.)

But for my current purposes this is good enough. I'm doing a one-time
analysis of several thousand records, and don't need to develop a permanent,
perfect solution. I revised the regular expression to be "good enough"
considering the data that I saw in front of me.

Many thanks for your help.
 
R

Rick Rothstein

I needed to make one more modification to my formula to correct an area
where it incorrectly missed the 7 digit number. I have tested this version
and I now believe it to work correctly in all situation. It is still an
array-entered** formula...

=MID(A1,MIN(IF(ISNUMBER(--MID(SUBSTITUTE(A1," ","x"),ROW(1:99),7))*
ISERR(SEARCH("e",MID(A1,ROW(1:99),7)))*ISERR(FIND("/",MID(A1,
ROW(1:99),7)))*ISERROR(--MID(A1&"x",ROW(1:99),8))*ISERROR(--MID(
"x"&A1,ROW(2:100)-1,1)),ROW(1:99))),7)

**Commit this formula using Ctrl+Shift+Enter, not just Enter by itself
 
R

Ron Rosenfeld

I see what you mean.

The problem I was getting was that the "\b" was excluding some cases that I
found in my data (such as "SR1234567" and "1234567remedy"). there were other
cases with a non-space immediately adjacent to the 7-digit sequence. So I
just took out the "\b" part of the pattern.

Aha. I see the problem. It's possible (but not necessary in view of what you
wrote below) to account for that. For example, one could look for the 7 digits
to be bounded by either a non-digit or the beginning or end of the line.
(I was also embarrassed when I presented the results and realized that my
7-digit numbers actually began at 987262 - i.e., a 6-digit number, so not all
of them were in fact 7 digits. So my logic was wrong and I missed one that I
should have found.)

Again, if you required a permanent solution, that could be adjusted for.
But for my current purposes this is good enough.

They do say that "perfect is the enemy of good enough" :)
I'm doing a one-time
analysis of several thousand records, and don't need to develop a permanent,
perfect solution. I revised the regular expression to be "good enough"
considering the data that I saw in front of me.

Many thanks for your help.

Glad to. I learn also. Thanks for the feedback.
--ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top