Text manipulation

Harlan Grove · Dec 29, 2005

Bob Phillips wrote...
....

If nIndex = 1 Then
Cells(i, "A").Value = Right(Cells(i, "A").Value, _
Len(Cells(i, "A").Value) - 1)
Else
Cells(i, "A").Value = Left(Cells(i, "A").Value, nIndex - 1) & _
Right(Cells(i, "A").Value, Len(Cells(i, "A").Value) - nIndex)
End If

....

Could simplify: Right(x, Len(x) - y) == Mid(x, y + 1)

Then again, the whole If block could be replaced by

Cells(i, "A").Value = _
Application.WorksheetFunction.Replace(Cells(i, "A").Value, nIndex,
1, "")

Harlan Grove · Dec 29, 2005

paulinoluciano wrote...
....

AASSASDKASASDASFAFSASASADKASASAFPKQREWEAQEOKSPADAOEKOQPPDAOPSKAEPQ

This sequence must be put in cell A2. ....
Rules:
a) Fragment the sequence before K but not always (you could have lost cut).
b) Sequence is not cut if K is found before FP

Results:

ASSASDKASASDASFAFSASASADKASASAFPKQREWEAQEOKSPADAOEKOQPPDAOPSKAEPQ

0 lost cut = Cutting the sequence all the time in which K is present
(The subsequences of this process should be put in B column:
AASSASDK
ASASDASFAFSASASADK
ASASAFPKQREWEAQEOK
SPADAOEK
OQPPDAOPSK
AEPQ

You could use formulas.

B2:
=LEFT($A$2,FIND("K",SUBSTITUTE($A$2,"FPK","###")))

B3:
=REPLACE(LEFT($A$2,FIND("K",SUBSTITUTE($A$2,"FPK","###")&"K",
SUMPRODUCT(LEN(B$2:B2))+1)),1,SUMPRODUCT(LEN(B$2:B2)),"")

Fill B3 down as needed. Filling it into B4:B8 (one cell more than
needed) gives B2:B8

AASSASDK
ASASDASFAFSASASADK
ASASAFPKQREWEAQEOK
SPADAOEK
OQPPDAOPSK
AEPQ

1 lost cut = Cutting the sequence after the first K present in the
sequence (The subsequences of this process should be put in C column::
AASSASDKASASDASFAFSASASADK
ASASAFPKQREWEAQEOKSPADAOEK
OQPPDAOPSKAEPQ
AASSASDKASASDASFAFSASASADKASASAFPKQREWEAQEOK
SPADAOEKOQPPDAOPSKAEPQ

You have all the information you need for this in column B.

C2:
=INDEX(B$2:B$99,2*ROWS(C$2:C2)-1)&INDEX(B$2:B$99,2*ROWS(C$2:C2))

Fill C2 down as needed. Filling it into C3:C5 (one more than needed)
gives C2:C5

AASSASDKASASDASFAFSASASADK
ASASAFPKQREWEAQEOKSPADAOEK
OQPPDAOPSKAEPQ

2 lost cut = = Cutting the sequence after the second K (just for the
third and following) present in the sequence (The subsequences of this
process should be put in D column:
AASSASDKASASDASFAFSASASADKASASAFPKQREWEAQEOK
SPADAOEKOQPPDAOPSKAEPQ

D2:
=INDEX(B$2:B$99,3*ROWS(D$2

2)-2)&INDEX(B$2:B$99,3*ROWS(D$2

2)-1)
&INDEX(B$2:B$99,3*ROWS(D$2

2))

Fill D2 down as needed. Filling it into D3

4 (one more than needed)
gives D2

4

AASSASDKASASDASFAFSASASADKASASAFPKQREWEAQEOK
SPADAOEKOQPPDAOPSKAEPQ

Repair that in some cases I need lost cuts in which you cut after 1, 2,
3, 4,... specific characters.
I have to specify such rules in some place of the sheet containing the
precursor text.
The rules are:

Cut after "XXX" (In this example I have put K but the some cell in the
sheet must contain what is the character in which the sequence will be
fragmented). In some cases it could be more than only one character
(e.g. K and R; nor necessarily together)
Cut before "XXX" (The cut may be after like previous example or before
the character)

Never before "XXX" (In some cases I have prohibitive situations; e.g.
It must not cut a sequence in K if K is preceeded by P or by RP)

The RP preceding K is redundant if P alone preceding K indicates a
prohibited situation. You'd only need to check for PK.

Never after "XXX" (Same for after)

Number of times that the character could be missed prior cut "XXX" (In
some place of the sheet I must explicit how many characters could be
"lost" prior cut (see example).

Generalizing the formulas above with the character to match in a cell
named CC and the prohibited sequence (in this case FPK) in a cell named
PS,

B2:
=LEFT($A$2,FIND(CC,SUBSTITUTE($A$2,PS,REPT("#",LEN(PS)))))

B3:
=REPLACE(LEFT($A$2,FIND(CC,SUBSTITUTE($A$2,PS,REPT("#",LEN(PS)))&CC,
SUMPRODUCT(LEN(B$2:B2))+1)),1,SUMPRODUCT(LEN(B$2:B2)),"")

The column C, D, etc. formulas wouldn't need to change.

If you have multiple prohibited sequences, then regular expressions
would be MUCH BETTER tools for doing this. Symbolic processing like
this is reducible to text processing, but Excel provides poor built-in
tools for text processing, but since it was meant to calculate numbers
this shouldn't be surprising.

paulinoluciano · Dec 29, 2005

Dear Harlan Grove,
Thank you very much for your attention. I have tried apply your
formulas but my excel related some problems. The message is that: "The
formula you typed contains an error". In addition, do you think that
would it be possible to perform such formulas by using VBA?
Luciano

Harlan Grove · Dec 29, 2005

Ron Rosenfeld wrote...
....

For the "0 lost cuts"

B1: =REGEX.MID(seq,"(\w+?([^FP]K|$)){"&COLUMN()-1&"}",ROW())

ROW() resolves to a '1' which means take the 'first' sequence that matches the
pattern. As you copy/drag the formula down, ROW() will resolve to '2', '3',
etc. which means match the 2nd, 3rd, etc sequence that matches the pattern.

The basic pattern is defined by "(\w+?([^FP]K|$)){" which means look for a
sequence of letters that ends with a K that is not preceded by an FP, or that
is at the end of the string.

....

Wrong. [^FP] means any characters other than F or P, not the sequence
FP. While your formula may have produced the OP's expected results, if
there were an "FK" in the sequence, it wouldn't have. You need to use a
negative look-ahead asertion.

B2:
=REGEX.MID($A$2,"([^K]?|.*?((?!FP)..))K",ROWS(B$2:B2))

Note that if 0, 1, 2, etc. cuts are always wanted, the 1, 2, etc. cuts
could all be generated from the 0 cuts by concatenating each group of
2, 3, etc. 0-cut items, respectively.

If there could be multiple prohibited sequences, e.g., never cut before
FPK or DK (note the different lengths), it actually gets simpler if you
use an inner REGEX.SUBSTITUTE call to mask the cut character (K) ending
prohibited sequences, then unmask it after the REGEX.MID calls.

B2:
=SUBSTITUTE(REGEX.MID(REGEX.SUBSTITUTE($A$2,"(FP|D)K","[1]_"),
".*?K",ROWS(B$10:B10)),"_","K")

If the cut character were named CC, the list of prohibited sequences -
not including the ending cut character - were named PS, then this could
be rewritten as

B2:
=SUBSTITUTE(REGEX.MID(REGEX.SUBSTITUTE($A$2,"("&MCONCAT(PS,"|")&")"&CC,
"[1]#"),".*?"&CC,ROWS(B$10:B10)),"#",CC)

making use if MOREFUNC.XLL's MCONCAT function as well.

If you REALLY want to learn regular expressions, read
comp.lang.perl.misc, comp.editors or comp.unix.shell. Then when you
THINK you know them, reply to threads in those ngs. You'll find out
VERY QUICKLY whether you know them or not.

paulinoluciano · Dec 30, 2005

Thank you very much Harlan Grove!
In fact, I just realize that it was just a language problem. My excel
has formula in portuguese; different from yours. I`m now trying put
your formula to run as VBA codes but in this case I think it will be
necessary to define each variable because some people will use thes
worksheet will not have morefunc.xll.
Luciano

paulinoluciano · Dec 30, 2005

paulinoluciano said:
Thank you very much Harlan Grove!
In fact, I just realize that it was just a language problem. My excel
has formula in portuguese; different from yours. I`m now trying put
your formula to run as VBA codes but in this case I think it will be
necessary to define each variable because some people will use thes
worksheet will not have morefunc.xll.
Luciano

Ron Rosenfeld · Dec 30, 2005

Ron Rosenfeld wrote...
...

For the "0 lost cuts"

B1: =REGEX.MID(seq,"(\w+?([^FP]K|$)){"&COLUMN()-1&"}",ROW())

ROW() resolves to a '1' which means take the 'first' sequence that matches the
pattern. As you copy/drag the formula down, ROW() will resolve to '2', '3',
etc. which means match the 2nd, 3rd, etc sequence that matches the pattern.

The basic pattern is defined by "(\w+?([^FP]K|$)){" which means look for a
sequence of letters that ends with a K that is not preceded by an FP, or that
is at the end of the string.

Click to expand...

...

Wrong. [^FP] means any characters other than F or P, not the sequence
FP. While your formula may have produced the OP's expected results, if
there were an "FK" in the sequence, it wouldn't have. You need to use a
negative look-ahead asertion.

B2:
=REGEX.MID($A$2,"([^K]?|.*?((?!FP)..))K",ROWS(B$2:B2))

Note that if 0, 1, 2, etc. cuts are always wanted, the 1, 2, etc. cuts
could all be generated from the 0 cuts by concatenating each group of
2, 3, etc. 0-cut items, respectively.

If there could be multiple prohibited sequences, e.g., never cut before
FPK or DK (note the different lengths), it actually gets simpler if you
use an inner REGEX.SUBSTITUTE call to mask the cut character (K) ending
prohibited sequences, then unmask it after the REGEX.MID calls.

B2:
=SUBSTITUTE(REGEX.MID(REGEX.SUBSTITUTE($A$2,"(FP|D)K","[1]_"),
".*?K",ROWS(B$10:B10)),"_","K")

If the cut character were named CC, the list of prohibited sequences -
not including the ending cut character - were named PS, then this could
be rewritten as

B2:
=SUBSTITUTE(REGEX.MID(REGEX.SUBSTITUTE($A$2,"("&MCONCAT(PS,"|")&")"&CC,
"[1]#"),".*?"&CC,ROWS(B$10:B10)),"#",CC)

making use if MOREFUNC.XLL's MCONCAT function as well.

If you REALLY want to learn regular expressions, read
comp.lang.perl.misc, comp.editors or comp.unix.shell. Then when you
THINK you know them, reply to threads in those ngs. You'll find out
VERY QUICKLY whether you know them or not.

Thank you for pointing that out. And also for the references to those other
NG's.

--ron

Ron Rosenfeld · Dec 30, 2005

B2:
=REGEX.MID($A$2,"([^K]?|.*?((?!FP)..))K",ROWS(B$2:B2))

By the way, I think in keeping with the OP's specifications, this should be:

=REGEX.MID(seq,"([^K]?|.*?((?!FP)..))(K|$)",ROWS(B$2:B2))

I think he always wants the end of the string, even if it doesn't end with a
'K'.

--ron

Harlan Grove · Dec 30, 2005

Ron Rosenfeld wrote...
....

By the way, I think in keeping with the OP's specifications, this should be:

=REGEX.MID(seq,"([^K]?|.*?((?!FP)..))(K|$)",ROWS(B$2:B2))

I think he always wants the end of the string, even if it doesn't end with a
'K'.

Good point, but it begs the question whether any symbols after the last
K would be considered valid data.

Ron Rosenfeld · Dec 30, 2005

Ron Rosenfeld wrote...
...

By the way, I think in keeping with the OP's specifications, this should be:

=REGEX.MID(seq,"([^K]?|.*?((?!FP)..))(K|$)",ROWS(B$2:B2))

I think he always wants the end of the string, even if it doesn't end with a
'K'.

Click to expand...

Good point, but it begs the question whether any symbols after the last
K would be considered valid data.

It's been a long time since I was fragmenting amino acid sequences ...

--ron

Ron Rosenfeld · Jan 1, 2006

Wrong. [^FP] means any characters other than F or P, not the sequence
FP. While your formula may have produced the OP's expected results, if
there were an "FK" in the sequence, it wouldn't have. You need to use a
negative look-ahead asertion.

B2:
=REGEX.MID($A$2,"([^K]?|.*?((?!FP)..))K",ROWS(B$2:B2))

Harlan,

How about a negative look-behind assertion?

=REGEX.MID($A$1,".*?(?<!FP)(K|$)",ROWS($B$2:B2))

--ron

Harlan Grove · Jan 2, 2006

Ron Rosenfeld said:
How about a negative look-behind assertion?

=REGEX.MID($A$1,".*?(?<!FP)(K|$)",ROWS($B$2:B2))

I keep forgetting Longre uses PCRE rather than VBScript regexp syntax. Too
many regexp varieties.

That should work for one literal substring in the assertion, but multiple
options, e.g., ".+?(?<!(FP|XYZ|D))K" [note: it's more efficient to append a
K to A1 and use a simple K at the end of the regexp than to use (K|$) - to
me at least that's a clearer indicator that trailing symbols after the last
K should be included] could cause backtracking problems. Assertions with
alternation subexpressions with closures (not the case here, but generally)
can become a big PITA. If there were multiple prohibited sequences, it may
be more efficient to mask immediately following cut characters and replace
them in the results. This is the regexp analog to using ancillary cells for
intermediate calculations rather than single huge formulas in spreadsheets.
Just as 2 or 3 formulas may recalc more quickly than a single equivalent
formula, 2 or 3 simpler regexp operations may recalc more quickly than a
single, more complex one. Assertions may work well in this case, but I tend
to avoid them when possible myself.

Ron Rosenfeld · Jan 3, 2006

Ron Rosenfeld said:
Ron Rosenfeld said:

How about a negative look-behind assertion?

=REGEX.MID($A$1,".*?(?<!FP)(K|$)",ROWS($B$2:B2))

Click to expand...

I keep forgetting Longre uses PCRE rather than VBScript regexp syntax. Too
many regexp varieties.

That should work for one literal substring in the assertion, but multiple
options, e.g., ".+?(?<!(FP|XYZ|D))K" [note: it's more efficient to append a
K to A1 and use a simple K at the end of the regexp than to use (K|$) - to
me at least that's a clearer indicator that trailing symbols after the last
K should be included] could cause backtracking problems. Assertions with
alternation subexpressions with closures (not the case here, but generally)
can become a big PITA. If there were multiple prohibited sequences, it may
be more efficient to mask immediately following cut characters and replace
them in the results. This is the regexp analog to using ancillary cells for
intermediate calculations rather than single huge formulas in spreadsheets.
Just as 2 or 3 formulas may recalc more quickly than a single equivalent
formula, 2 or 3 simpler regexp operations may recalc more quickly than a
single, more complex one. Assertions may work well in this case, but I tend
to avoid them when possible myself.

I will need to keep that in mind. Clearly I have insufficient experience with
regular expressions to have run into some of the issues you have mentioned.

I will need to think more about the backtracking issues in order to better
understand what you write. As well as the pros and cons of trying to avoid
assertions.

I have been trying to work out some of the questions posed on the NG's you
referred me to. Educational. Thanks.

Best wishes and Happy New Year,

--ron
--ron

Harlan Grove · Jan 3, 2006

Ron Rosenfeld wrote...
....

I will need to keep that in mind. Clearly I have insufficient experience with
regular expressions to have run into some of the issues you have mentioned.

I will need to think more about the backtracking issues in order to better
understand what you write. As well as the pros and cons of trying to avoid
assertions.

....

If you want to get a real grounding in regular expressions, read
Jeffrey Friedl's book 'Mastering Regular Expressions, 2nd Ed.', ISBN
0-596-00289-0. Lots of discussion about backtracking and optimizing
regular expessions.

Ron Rosenfeld · Jan 3, 2006

Ron Rosenfeld wrote...
...
...

If you want to get a real grounding in regular expressions, read
Jeffrey Friedl's book 'Mastering Regular Expressions, 2nd Ed.', ISBN
0-596-00289-0. Lots of discussion about backtracking and optimizing
regular expessions.

Thanks. Just ordered a copy from Amazon.

--ron

paulinoluciano · Jan 10, 2006

Thank you Harlan Grove!
In fact, this solve my first "problem" related to this kind of text
manipulation. However, now I have to specify in such VBA code that the
sequence could be cut after or before two or more letters
simultaneously.
Example:

The sequence is

IADASFDTYEREPWQNMSDFGHKEASADSASSASADRASERAS

cut after K or R

0 lost:
IADASFDTYER
EPWQNMSDFGHK
EASADSASSASADR
SER
AS

1 lost cut:
IADASFDTYEREPWQNMSDFGHK
EPWQNMSDFGHKEASADSASSASADR
EASADSASSASADRSER
SERAS
......

paulinoluciano · Jan 12, 2006

paulinoluciano said:
Thank you Harlan Grove!
In fact, this solve my first "problem" related to this kind of text
manipulation. However, now I have to specify in such VBA code that the
sequence could be cut after or before two or more letters
simultaneously.
Example:

The sequence is

IADASFDTYEREPWQNMSDFGHKEASADSASSASADRASERAS

cut after K or R

0 lost:
IADASFDTYER
EPWQNMSDFGHK
EASADSASSASADR
SER
AS

1 lost cut:
IADASFDTYEREPWQNMSDFGHK
EPWQNMSDFGHKEASADSASSASADR
EASADSASSASADRSER
SERAS
.....