Ethan/Peter,
Firstly thank you both for your comments. Peter, I am familiar with
Smith-Waterman. The implementation that you have indicated (Java I
think) is for Gotoh's improvement. I have seen this before and found
it a bit complex. With that in mind could you have a look at the
following URL:
http://mlab.cb.k.u-tokyo.ac.jp/~moris/LSGSP/Chapter6_ApproximateStringMatching.java
There are several algorithms here (including smith-Waterman-Gotoh)
and I prefer this implementation. I think it uses Affile Gap Penalties
(cf.
http://en.wikipedia.org/wiki/Gap_penalty)
As for doing a callaboration together, this takes me to my next
point. Peter, thank you for you comments. It's quite a response. I am
aware of the danger here - the problem is exponential, this is true.
However, the website:
http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html
(particularly the bit at the bottom concerning "alternative
solutions") leads me to believe that it may be possible - in the sense
of working at the traceback level and using the fact that the
solutions must never exceed the maximum value (smith waterman) or the
bottom right value (needleman-wunsch)
However I do not have a nice metric to offer to rank any solutions
that may be obtained. This still eludes me.
I am particularly interested in solutions that minimise any
introduction of a gap in the corpus text (i.e. the text that you are
searching through). This condition could help to reduce the sheer
number of possible permutations, I think...
I would like to offer one more thing. This is slightly off topic but
I would like to offer it tentatively. I obtained the optimal solution
and compared and contrasted it with the corpus in an effort to extract
a regular expression. I then use this regular expression to search for
similar patterns. Something like this:
CORPUS: ZYGZBCZ
PATTERN: ZYY-BCZ
Hence I extract a regEx like: "ZY[GY]?BCZ" and see if patterns like
this exist elsewhere. Its off topic and a bit of a dodge to be quite
honest but I tentatively offer it.
Al.