Challenging Data Cleanup Function

T

Tom Bock

I need to do some data cleanup... this appears to be a "tricky" process
though.

I have 2 columns (A and B). Any cell in column A might contain a long (or
short) string (see sample data below between ****s)


****** SAMPLE DATA *****************

Cell A1:
1.1.1. Provide and facilitate the conduct of Training 2.1.5. Enhance
consultation command and control 3.1. Enhance Alliance Interoperability
including through Standardization 4.1. Enhance the interoperability

Cell A2:
1.2 Conduct Training

Cell A3
2 Do this 2.1.5. Do that

***********************


The text value in column B is mostly short (e.g. "Coordinating all
Functional Area Specialty Education and Training.") and is of no particular
importance.


Here's what I need to achieve (column A):
- "Read" the cells in column A
- "Recognize" that the string in cell A1 contains 4 "numeric values" (1.1.1,
2.1.5, 3.1, & 4.1 -- there might be odd spacing or even tabs between these)
- Because of the 4 numeric values
-- insert 3 rows (below row 1) and copy/paste row 1 three times (so I end
up with 4 identical entries), then
- in cell A1, overwrite the long string (as listed above the ***s) with
"1.1.1"
- in cell A2, overwrite the long string (as listed above the ***s) with
"2.1.5"
- in cell A3, overwrite the long string (as listed above the ***s) with
"3.1"
- in cell A4, overwrite the long string (as listed above the ***s) with
"4.1"

However, no action is required with A5 (used to be cell reference A2 --
between the ***s) since it only contains
one numeric value ("1.2 Conduct Training").

Lastly, row 6 (old cell ref A3 -- "2 Do this 2.1.5. Do that" -- refer to
example between ***s) must be treated just like the other example, except
that only 2 numeric values should be recognized and therefore only 1
identical row must be inserted, copied/pasted, and overwritten with "2" &
"2.1.5" in A6 & A7, respectively.

Does anyone know how to tackle this problem???

Thanks, I appreciate any feedback!!!


Tom
 
H

Harlan Grove

...
...
****** SAMPLE DATA *****************

Cell A1:
1.1.1. Provide and facilitate the conduct of Training 2.1.5. Enhance
consultation command and control 3.1. Enhance Alliance Interoperability
including through Standardization 4.1. Enhance the interoperability

Cell A2:
1.2 Conduct Training

Cell A3
2 Do this 2.1.5. Do that

*********************** ...
Here's what I need to achieve (column A):
- "Read" the cells in column A
- "Recognize" that the string in cell A1 contains 4 "numeric values" (1.1.1,
2.1.5, 3.1, & 4.1 -- there might be odd spacing or even tabs between these)
- Because of the 4 numeric values
-- insert 3 rows (below row 1) and copy/paste row 1 three times (so I end
up with 4 identical entries), then
- in cell A1, overwrite the long string (as listed above the ***s) with
"1.1.1"
- in cell A2, overwrite the long string (as listed above the ***s) with
"2.1.5"
- in cell A3, overwrite the long string (as listed above the ***s) with
"3.1"
- in cell A4, overwrite the long string (as listed above the ***s) with
"4.1"

However, no action is required with A5 (used to be cell reference A2 --
between the ***s) since it only contains one numeric value ("1.2 Conduct
Training").

Lastly, row 6 (old cell ref A3 -- "2 Do this 2.1.5. Do that" -- refer to
example between ***s) must be treated just like the other example, except
that only 2 numeric values should be recognized and therefore only 1
identical row must be inserted, copied/pasted, and overwritten with "2" &
"2.1.5" in A6 & A7, respectively.

Does anyone know how to tackle this problem???

To put it mildly, this problem is not well suited to spreadsheets as you'll see
from the length and complexity of the formulas needed to do this. For my own
convenience, I'll name your original range of cells OrigEntries. I'll also
define the name Seq referring to =ROW(INDIRECT("1:1024")). Then in another
worksheet enter the following formulas.

A1 [array formula]:
=MID(T(OrigEntries),MATCH(TRUE,ISNUMBER(-MID(T(OrigEntries),Seq,1)),0),1024)

B1 [array formula]:
=MATCH(TRUE,ISNUMBER(-MID(A1,Seq,1)),0)

C1 [array formula]:
=MATCH(FALSE,ISNUMBER(FIND(MID(A1,Seq+B1-1,1),".0123456789")),0)+B1-1

D1:
=MID(A1,B1,C1-B1)

A2 [array formula]:
=IF(OR(ISNUMBER(-MID(A1,Seq+C1,1))),MID(A1,C1+1,1024),
MID(INDEX(OrigEntries,COUNTIF(B$1:B1,1)+1),MATCH(TRUE,
ISNUMBER(-MID(INDEX(OrigEntries,COUNTIF(B$1:B1,1)+1),
Seq,1)),0),1024))

Select B1:D1 and fill into B2:D2. Then select A2:D2 and fill down as far as
needed. The formulas will return #REF! and #N/A error values when the data in
OrigEntries has been exhausted. Select the col D nonerror values, copy, and
paste special as values on top of OrigEntries, then delete the worksheet that
contains the formulas above.

Note that this could be done using a command line Perl script with a single Perl
statement, specifically,

perl -pe "s/([^.0-9])((\d+\.?)+)/$1\n$2/g" original.data > cleansed.data

Use the right tool for the task. Excel is *NEVER* the right tool for *ANY* type
of text processing that breaks apart or condenses its input. It can be done in
Excel, but that proves nothing about suitability to task.
 
T

Tom Bock

Harlan:

Thank you so much for your feedback. Yes, I agree, Excel is probably not
the proper tool for this.

Anyhow, I tried to follow your procedures, but I'm not following you all the
way.

- I replaced all of your strings "OrigEntries" with "A:A".
- Since you recommended to use another worksheet, I then tried "Test!A:A"
(while "Test" is the name of the worksheet that contains the data to be
cleanup up.
- I also was not certain about the "Seq"... so far I put
"=ROW(INDIRECT("1:1024"))" instead of the "Seq"... that naturally threw an
error because of the "=" in it. I took that out but I still receive errors
in the functions.
- Also, when do I need to use the "A1(array formula]" vs. the "A2 [array
formula]"? My sample data was arbitrary. I have hundreds of rows....
some of them may contain multiple "numeric values" and other don't...
however, this could change on a case-by-case basis.

Well, even though you've put lots of work into your response, I am afraid
that I'm unable to interpret your info. Do you have any additional info
you could provide me in this matter?

Thanks,
Tom





Harlan Grove said:
...
..
****** SAMPLE DATA *****************

Cell A1:
1.1.1. Provide and facilitate the conduct of Training 2.1.5. Enhance
consultation command and control 3.1. Enhance Alliance Interoperability
including through Standardization 4.1. Enhance the interoperability

Cell A2:
1.2 Conduct Training

Cell A3
2 Do this 2.1.5. Do that

*********************** ..
Here's what I need to achieve (column A):
- "Read" the cells in column A
- "Recognize" that the string in cell A1 contains 4 "numeric values" (1.1.1,
2.1.5, 3.1, & 4.1 -- there might be odd spacing or even tabs between these)
- Because of the 4 numeric values
-- insert 3 rows (below row 1) and copy/paste row 1 three times (so I end
up with 4 identical entries), then
- in cell A1, overwrite the long string (as listed above the ***s) with
"1.1.1"
- in cell A2, overwrite the long string (as listed above the ***s) with
"2.1.5"
- in cell A3, overwrite the long string (as listed above the ***s) with
"3.1"
- in cell A4, overwrite the long string (as listed above the ***s) with
"4.1"

However, no action is required with A5 (used to be cell reference A2 --
between the ***s) since it only contains one numeric value ("1.2 Conduct
Training").

Lastly, row 6 (old cell ref A3 -- "2 Do this 2.1.5. Do that" -- refer to
example between ***s) must be treated just like the other example, except
that only 2 numeric values should be recognized and therefore only 1
identical row must be inserted, copied/pasted, and overwritten with "2" &
"2.1.5" in A6 & A7, respectively.

Does anyone know how to tackle this problem???

To put it mildly, this problem is not well suited to spreadsheets as you'll see
from the length and complexity of the formulas needed to do this. For my own
convenience, I'll name your original range of cells OrigEntries. I'll also
define the name Seq referring to =ROW(INDIRECT("1:1024")). Then in another
worksheet enter the following formulas.

A1 [array formula]:
=MID(T(OrigEntries),MATCH(TRUE,ISNUMBER(-MID(T(OrigEntries),Seq,1)),0),1024)

B1 [array formula]:
=MATCH(TRUE,ISNUMBER(-MID(A1,Seq,1)),0)

C1 [array formula]:
=MATCH(FALSE,ISNUMBER(FIND(MID(A1,Seq+B1-1,1),".0123456789")),0)+B1-1

D1:
=MID(A1,B1,C1-B1)

A2 [array formula]:
=IF(OR(ISNUMBER(-MID(A1,Seq+C1,1))),MID(A1,C1+1,1024),
MID(INDEX(OrigEntries,COUNTIF(B$1:B1,1)+1),MATCH(TRUE,
ISNUMBER(-MID(INDEX(OrigEntries,COUNTIF(B$1:B1,1)+1),
Seq,1)),0),1024))

Select B1:D1 and fill into B2:D2. Then select A2:D2 and fill down as far as
needed. The formulas will return #REF! and #N/A error values when the data in
OrigEntries has been exhausted. Select the col D nonerror values, copy, and
paste special as values on top of OrigEntries, then delete the worksheet that
contains the formulas above.

Note that this could be done using a command line Perl script with a single Perl
statement, specifically,

perl -pe "s/([^.0-9])((\d+\.?)+)/$1\n$2/g" original.data > cleansed.data

Use the right tool for the task. Excel is *NEVER* the right tool for *ANY* type
of text processing that breaks apart or condenses its input. It can be done in
Excel, but that proves nothing about suitability to task.
 
H

Harlan Grove

...
...
Anyhow, I tried to follow your procedures, but I'm not following you all the
way.

- I replaced all of your strings "OrigEntries" with "A:A".

Bad idea. Many functions can't work with entire column ranges. The formulas I
gave may be among the few exceptions that can work with entire column ranges,
but this is sufficiently dangerous that you should avoid doing so.
- Since you recommended to use another worksheet, I then tried "Test!A:A"
(while "Test" is the name of the worksheet that contains the data to be
cleanup up.
OK.

- I also was not certain about the "Seq"... so far I put
"=ROW(INDIRECT("1:1024"))" instead of the "Seq"... that naturally threw an
error because of the "=" in it. I took that out but I still receive errors
in the functions.

Use the menu command Insert > Name > Define. Enter Seq in the topmost entry
field in the Define Name dialog, then enter

=ROW(INDIRECT("1:1024"))

in the Refers To field in the dialog and click the OK button. It's likely the
formulas I gave *WON'T* work if you try replacing Seq with
ROW(INDIRECT("1:1024")) because of Excel's limit on nested function calls. If
you don't make Seq a defined name, you're on your own to figure out how to make
the formulas work (because I'm pretty sure it can't be done).
- Also, when do I need to use the "A1(array formula]" vs. the "A2 [array
formula]"? My sample data was arbitrary. I have hundreds of rows....
some of them may contain multiple "numeric values" and other don't...
however, this could change on a case-by-case basis.

All formulas marked as [array formula] *MUST* be entered as array formulas,
meaning hold down [Ctrl] and [Shift] keys before pressing [Enter].

Now, if you have some cells in your original data that contain no numerals at
all, then I'm pretty sure my formulas will choke at and after those cells. You'd
have to process the data chunk by chunk, using the formulas when they work,
identifying cells when they don't and handling those manually, then using the
formulas from the beginning starting with the next cell containing numerals. If
you were using a defined name like OrigEntries, this would be simple - just
redefine the range each time to reinitialize the formulas. Not using the defined
name, you're going to be doing a lot of manual copy and paste operations.

While I tested my approach on your sample data, I may not be the best person to
help you figure out how to use my approach. I'm pretty confident that if you
make even a few seemingly minor changes that the formulas won't work.

So I'd suggest you first try the procedure *exactly* as I outlined it. Step one
would be using the menu command Insert > Names > Define to create a defined
name referring to your data range. Step two would be creating the defined name
Seq referring to =ROW(INDIRECT("1:1024")). Then create a new, blank worksheet
and enter the formulas exactly as I wrote them in the cells specified either as
array formulas (cols A through C) or normal formulas (only col D), noting that
cell A1's formula differs from the formula in cell A2 and subsequent rows. I
know this procedure works for cells in the original data range that contain
numerals. To be honest, life's too short to explore variations on the theme.
 
T

Tom

Harlan:

Thanks for your prompt reply...

ah, I didn't know about the "defining of a range"... I'll
give it a shot tomorrow and will let you know whether or
not it worked.

Thanks again!!!

Tom

-----Original Message-----
...
...
Anyhow, I tried to follow your procedures, but I'm not following you all the
way.

- I replaced all of your strings "OrigEntries"
with "A:A".

Bad idea. Many functions can't work with entire column ranges. The formulas I
gave may be among the few exceptions that can work with entire column ranges,
but this is sufficiently dangerous that you should avoid doing so.
- Since you recommended to use another worksheet, I then tried "Test!A:A"
(while "Test" is the name of the worksheet that contains the data to be
cleanup up.
OK.

- I also was not certain about the "Seq"... so far I put
"=ROW(INDIRECT("1:1024"))" instead of the "Seq"... that naturally threw an
error because of the "=" in it. I took that out but I still receive errors
in the functions.

Use the menu command Insert > Name > Define. Enter Seq in the topmost entry
field in the Define Name dialog, then enter

=ROW(INDIRECT("1:1024"))

in the Refers To field in the dialog and click the OK button. It's likely the
formulas I gave *WON'T* work if you try replacing Seq with
ROW(INDIRECT("1:1024")) because of Excel's limit on nested function calls. If
you don't make Seq a defined name, you're on your own to figure out how to make
the formulas work (because I'm pretty sure it can't be done).
- Also, when do I need to use the "A1(array formula]" vs. the "A2 [array
formula]"? My sample data was arbitrary. I have hundreds of rows....
some of them may contain multiple "numeric values" and other don't...
however, this could change on a case-by-case basis.

All formulas marked as [array formula] *MUST* be entered as array formulas,
meaning hold down [Ctrl] and [Shift] keys before pressing [Enter].

Now, if you have some cells in your original data that contain no numerals at
all, then I'm pretty sure my formulas will choke at and after those cells. You'd
have to process the data chunk by chunk, using the formulas when they work,
identifying cells when they don't and handling those manually, then using the
formulas from the beginning starting with the next cell containing numerals. If
you were using a defined name like OrigEntries, this would be simple - just
redefine the range each time to reinitialize the
formulas. Not using the defined
 
T

Tom

Harlan:

I finally found a moment to work w/ this problem again.

Okay, I got all of the functions working now (thanks for the good
instructions). This is truly a superb!!!!

I hope you won't mind if I ask you one more question. What do the numbers
in column B & C mean (please see below)?

Cells B1:C1:
1 5

Cells B2:C2:
50 54

Cells B3:C3:
74 78



--
Thanks,
Tom


Tom said:
Harlan:

Thanks for your prompt reply...

ah, I didn't know about the "defining of a range"... I'll
give it a shot tomorrow and will let you know whether or
not it worked.

Thanks again!!!

Tom

-----Original Message-----
...
...
Anyhow, I tried to follow your procedures, but I'm not following you all the
way.

- I replaced all of your strings "OrigEntries"
with "A:A".

Bad idea. Many functions can't work with entire column ranges. The formulas I
gave may be among the few exceptions that can work with entire column ranges,
but this is sufficiently dangerous that you should avoid doing so.
- Since you recommended to use another worksheet, I then tried "Test!A:A"
(while "Test" is the name of the worksheet that contains the data to be
cleanup up.
OK.

- I also was not certain about the "Seq"... so far I put
"=ROW(INDIRECT("1:1024"))" instead of the "Seq"... that naturally threw an
error because of the "=" in it. I took that out but I still receive errors
in the functions.

Use the menu command Insert > Name > Define. Enter Seq in the topmost entry
field in the Define Name dialog, then enter

=ROW(INDIRECT("1:1024"))

in the Refers To field in the dialog and click the OK button. It's likely the
formulas I gave *WON'T* work if you try replacing Seq with
ROW(INDIRECT("1:1024")) because of Excel's limit on nested function calls. If
you don't make Seq a defined name, you're on your own to figure out how to make
the formulas work (because I'm pretty sure it can't be done).
- Also, when do I need to use the "A1(array formula]" vs. the "A2 [array
formula]"? My sample data was arbitrary. I have hundreds of rows....
some of them may contain multiple "numeric values" and other don't...
however, this could change on a case-by-case basis.

All formulas marked as [array formula] *MUST* be entered as array formulas,
meaning hold down [Ctrl] and [Shift] keys before pressing [Enter].

Now, if you have some cells in your original data that contain no numerals at
all, then I'm pretty sure my formulas will choke at and after those cells. You'd
have to process the data chunk by chunk, using the formulas when they work,
identifying cells when they don't and handling those manually, then using the
formulas from the beginning starting with the next cell containing numerals. If
you were using a defined name like OrigEntries, this would be simple - just
redefine the range each time to reinitialize the
formulas. Not using the defined
name, you're going to be doing a lot of manual copy and paste operations.

While I tested my approach on your sample data, I may not be the best person to
help you figure out how to use my approach. I'm pretty confident that if you
make even a few seemingly minor changes that the formulas won't work.

So I'd suggest you first try the procedure *exactly* as I outlined it. Step one
would be using the menu command Insert > Names > Define to create a defined
name referring to your data range. Step two would be creating the defined name
Seq referring to =ROW(INDIRECT("1:1024")). Then create a new, blank worksheet
and enter the formulas exactly as I wrote them in the cells specified either as
array formulas (cols A through C) or normal formulas (only col D), noting that
cell A1's formula differs from the formula in cell A2 and subsequent rows. I
know this procedure works for cells in the original data range that contain
numerals. To be honest, life's too short to explore variations on the theme.
 
H

Harlan Grove

...
...
I hope you won't mind if I ask you one more question. What do the numbers
in column B & C mean (please see below)?

Cells B1:C1:
1 5

Cells B2:C2:
50 54

Cells B3:C3:
74 78

They're the beginning and ending character positions of the 'numeric' substrings
of the strings in col A.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top