Advanced RegEx (pattern clustering)

  • Thread starter Thread starter skavan
  • Start date Start date
S

skavan

Hi,

I'm just wrapping my head around regex and am pretty sure it can do the
task at hand - but it's too complex for my brain to process -- so am
throwing it out there for you experts to comment on. I am posing two
questions. In the interests of space and focus, I'll post a separate
thread for the other use case (clustering).

Use Case 1:
Filenames contain a TrackNumber (or not).

Examples:
01 - Calexico - Sonic Wind (instrumental mix).mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3
Carl Orff - Carmina Burana - 08 - Uf dem anger- Chramer, gip .mp3
01-linkin_park_-_foreword-mp3.mp3
[03] (Wish I Could Fly Like) Superman.mp3

Other examples might be: (XX), XX-, -XX-, - XX - , - XX,-XX
Where XX is a one or two digit number.

Specific examples of things that should not be captured:
Jethro Tull - 1999 - Live At House Of Blues - 13 - Hunting Girl.mp3

The 1999 i snot a track number, but the 13 is. A rule that the number
should be 2 digits should catch one.

Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
The date should not be captured, but the 33 should.

UB40 - 08 - Sing Our Own Song.mp3
The 40 shouldn't be captured, but the 08 should.

Blink 182 - Take Off Your Pants And Jacket - 06 - The Rock Show.mp3
The 182 should not be captured, but the 06 should.

One more case:
08_Smokie_Living Near The Edge.mp3

Phew...sorry for the length of the post --- can one put together a
regex to tackle this problem?

If so --- I will be both amazed and grateful for your suggestions.

Thanks.

P.S. Part 2 of this will deal with clustering...
 
Well, you're starting out by making the most common mistake that people make
who use regular expressions. Instead of giving us a set of rules, you give
us a bunch of examples. The problem with this is that the examples only
*hint* at the underlying rules, and do not spell them out. One could derive
several different sets of rules from the examples you've given.

In case you don't understand, I'll give you an example. See if you can tweak
the example into the exact rules for your regular expression:

1. The string or set of strings will (will not?) consist entirely of file
names.
2. A "Track Number" (not?) always consists of exactly 2 digits.
3. These 2 digits may appear anywhere in the file name, except for the
extension.
4. These 2 digits will (not?) always be delimited by punctuation marks.
5. If at the beginning or end of the file name, only 1 (possibly more than
1?) mark is used.
6. The set of possible punctuation marks consists of: [], -, _, ()
7. The punctuation marks will always immediately (no spaces)
precede and/or follow the "Track Number" with one exception.
8. Hyphens will always have a single (or more?) space between the hyphen and
the
"Tracking Number" and between the hyphen and the rest of the file name.
9. There will never be any other substrings in the strings that follows
these rules.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.

skavan said:
Hi,

I'm just wrapping my head around regex and am pretty sure it can do the
task at hand - but it's too complex for my brain to process -- so am
throwing it out there for you experts to comment on. I am posing two
questions. In the interests of space and focus, I'll post a separate
thread for the other use case (clustering).

Use Case 1:
Filenames contain a TrackNumber (or not).

Examples:
01 - Calexico - Sonic Wind (instrumental mix).mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3
Carl Orff - Carmina Burana - 08 - Uf dem anger- Chramer, gip .mp3
01-linkin_park_-_foreword-mp3.mp3
[03] (Wish I Could Fly Like) Superman.mp3

Other examples might be: (XX), XX-, -XX-, - XX - , - XX,-XX
Where XX is a one or two digit number.

Specific examples of things that should not be captured:
Jethro Tull - 1999 - Live At House Of Blues - 13 - Hunting Girl.mp3

The 1999 i snot a track number, but the 13 is. A rule that the number
should be 2 digits should catch one.

Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
The date should not be captured, but the 33 should.

UB40 - 08 - Sing Our Own Song.mp3
The 40 shouldn't be captured, but the 08 should.

Blink 182 - Take Off Your Pants And Jacket - 06 - The Rock Show.mp3
The 182 should not be captured, but the 06 should.

One more case:
08_Smokie_Living Near The Edge.mp3

Phew...sorry for the length of the post --- can one put together a
regex to tackle this problem?

If so --- I will be both amazed and grateful for your suggestions.

Thanks.

P.S. Part 2 of this will deal with clustering...
 
Good point. In fact, writing the rules helps really clarify the
problem. Here goes:

1. The set of strings will consist entirely of filenames.
2. A <Track Number> will consist of 1 *OR* 2 digits in the range of 1
to 36.
3. A <Track Number> witha value of less than 10, may be preceded by a
zero.
4. <Track Number> cannot be guaranteed to be the only digits in the
string.
5. <Track Number> will be preceded by one of the following: <SPACE>, {,
[, <, (, _, -
6. The exception to #5 is if <Track Number> is at the start of the
string.
7. If <Track Number> is preceded by an opening punctation character: (
< { [, then <Track Number> will be followed by the corresponding
closing punctuation character.
8. If <Track Number> is not preceded by a opening punctation character,
it will be followed by either: <SPACE>, _,- or an opening punctation
character (for the next field in the string).
9. There may be additional spaces before and after the delimiters
specified in 8 and before but not after the Open Punctation delimiters
and after but not before the closing punctuation characters.
10. There will never be any other substrings in the strings that
follows these rules.

Wow - that seems to really specify the problem. I'm feeling terrific
about it. Except for one tiny, teeny, thing.
I'm still STUCK!!!! h-e-l-p. Eternal thanks to someone who can
translate 1-10 into regex or otherwise.

Thanks.

s.
 
I'm glad I was able to help youwith your analysis. Problem-solving is a
really important skill to have as a programmer, and the ability to spell out
business rules is the most important key to writing good code. As you can
see, this involves a process of breaking down the requirements into smaller
and smaller bites, until you have atomic business rules.

I was able to construct a Regular Expression based upon your business rules.
However, there is a problem, and I'm not sure it can be solved. First,
here's the regular expression:

(?m)(?<=[\{\(\[\<_]|\-\s|^)\d{1,2}(?=[\}\)\_\>\]]|\s\-|$)

In English, this means:

1. Caret and dollar match new lines.
2. A match is 1 or 2 digits.
3. The digits must be preceded by one of the following:
a. One of the following characters: { [ ( _ <
b. A hyphen followed by a space.
c. Be at the beginning of the line.
4. The digits must be followed by one of the following:
a. One of the following characters: } ] ) _ >
b. A space followed by a hyphen
c. Be at the end of the line.

Here's the problem with it. Consider these 2 examples you included:

Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
01-linkin_park_-_foreword-mp3.mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3

The problem is, in case you can't see it, what do do about digits that are
preceded or followed by a hyphen *without* a space? If you allow it, you
pick up "-13-" in the date. If you disallow it, you don't pick up the "01-"
in the second example, or the "-1-" in the third example.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.

skavan said:
Good point. In fact, writing the rules helps really clarify the
problem. Here goes:

1. The set of strings will consist entirely of filenames.
2. A <Track Number> will consist of 1 *OR* 2 digits in the range of 1
to 36.
3. A <Track Number> witha value of less than 10, may be preceded by a
zero.
4. <Track Number> cannot be guaranteed to be the only digits in the
string.
5. <Track Number> will be preceded by one of the following: <SPACE>, {,
[, <, (, _, -
6. The exception to #5 is if <Track Number> is at the start of the
string.
7. If <Track Number> is preceded by an opening punctation character: (
< { [, then <Track Number> will be followed by the corresponding
closing punctuation character.
8. If <Track Number> is not preceded by a opening punctation character,
it will be followed by either: <SPACE>, _,- or an opening punctation
character (for the next field in the string).
9. There may be additional spaces before and after the delimiters
specified in 8 and before but not after the Open Punctation delimiters
and after but not before the closing punctuation characters.
10. There will never be any other substrings in the strings that
follows these rules.

Wow - that seems to really specify the problem. I'm feeling terrific
about it. Except for one tiny, teeny, thing.
I'm still STUCK!!!! h-e-l-p. Eternal thanks to someone who can
translate 1-10 into regex or otherwise.

Thanks.

s.
 
Since I can't think of any common strings that would have this effect
OTHER than date formats, the
simple approach would be to eliminate date formats before a final scan
using the regex above.

In the simple case, a date format in this context will, I think, always
have the following rule:

1. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits
AND/OR
2. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits

This should capture the middle digits and then expand to ignore the
string of digits appropriately.

So:
a) Do you think this would work?
b) Is it a preliminary regex or can it be pre-pended to the string
above?
c) What does it look like?

BTW - Your thought process is pretty good for a "Professional
Numbskull" :).

s.
 
Here's a mod that may work for you:

(?m)(?!<\d+\s?\-\s?)(?<=[\{\(\[\<_]|\-\s?|^)\d{1,2}(?=[\}\)\_\>\]]|\s?\-|$)(?!\s?\-\s?\d+)

This is identical to the first, with a couple of changes and additions.
First, the spaces with the hyphens are now optional (\s? means 0 or 1
space). Second, I added a negative look-behind to the beginning, and a
negative look-ahead at the end. The negative look-behind states that the
match cannot be preceded by 1 or more digits followed by 0 or 1 space and a
hyphen followed by 0 or 1 space. The negative look-behind states that the
match cannot be followed by 0 or 1 space followed by a hyphen followed by 0
or 1 space followed by 1 or more numbers.

Of course, you realize that there are not hard and fast rules for this sort
of thing. Anyone can give any name to an mp3 file. But it works for all the
examples you gave.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Back
Top