Advanced RegEx (the cluster problem).

S

skavan

Use Case:
We have music files that describe, in their filename, attributes of the
music.
We do not know a general pattern that applies to all filenames -- but
we do know that filenames that are clustered together (by for example
directory) will, most likely, have the same filename pattern.

Here is an example:
10,000 Maniacs - MTV Unplugged - 01 - These are Days.mp3

on its own, we can deduce little, except perhaps that the 01 is a
tracknumber (see my previous post). But if we know that all filenames
in the same directory, belong to the same album, but will have
different titles, then maybe, just maybe we can xform this filename
into its component parts.

So, if we have in our possession say, 2 more filenames:
10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3

Then we should be able to parse all the filenames into:
<var1>,<var2>,<tracknum>,<title>

Even if we don't know whether 10,000 Maniacs or MTV Unplugged is the
name of the Album or the Artists (this bit can be deduced by other
heuristics).

Now, it may be that a cluster of filenames has fewer properties:
4 Non Blondes - Train.mp3
4 Non Blondes - No Place Like Home.mp3

In this case the 4 is part of the Artist name which can be deduced,
since two tracks wouldn't have the same tracknumber.

As more example clusters:
01 What's the Matter Here.mp3
02 Hey Jack Kerouac.mp3

50 Cent - Back Down.mp3
50 Cent - Gotta Get.mp3

10 Just Like Anyone Aimee Mann Bachelor No. 2 Rock.mp3
09 Driving Sideways Aimee Mann Bachelor No. 2 Rock.mp3

This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?
 
L

Lucian Wischik

skavan said:
This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?

My instinct is strongly not to use regex. Instead use more generic
clustering techniques. That way you'll also be able to add extra
data-sources (e.g. MP3 tags like genre, volume level, statistical
analysis of music) to get even more useful clustering. Also you'll be
more robust against other naming schemes the users have come up with
but you haven't thought of in advance.

(I'm guessing that you're doing this in order to present a better UI
to users for them to navigate their music collection).

PS. I've never coded any kind of clustering analysis. All I know about
it is seeing talks given by other researchers. So take what I say with
a pinch of salt.
 
G

Guest

What a great problem!

I'm going to work from the assumption that we're working only within a directory - and if this doesn't naturally scale up to include directory names you'll say so and we can take another look...

Another assumption I'll use is that you can represent your strings in 8 bit characters. Unicode and MBCS present problems about the # of bytes needed to represent any character that, in some cases, cause characters to be variable length. That makes things far more difficult...

Here's what I see is the problem:
(Within a directory...) your examples demonstrate that there are one or more substrings within each filename that match (for any given cluster) - and your need is to be able to extract these substrings.

Here is a collection of techniques to detect similarities / differences between file names:

1. If you represent the strings as arrays of bytes you can then "subtract" one string from another:

String A: 10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
String B: 10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3
Diff : 0000000000000000000000000000000000.000.0.......................

where a dot represents the numerical value of the difference of the ASCII characters

From this you can easily detect that this pair start with a long string of characters that match - up to a common position; after that there's a short match string and then it's all unmatched.
Note that since track numbers may not have leading zeros they can be variable length. To handle this you'll have to do this comparison three times (for every pair of file names compared): aligned, left shift filename 2, right shift filename 2 (each shift is by one character position). Of the three comparisons - one will show up with a long batch of zeros and the other two will not.

Your last example presents a further shift problem - where the variable stuff (which is likely to be variable length) is at the beginning. For these, you'll need to align the strings on the end, not the beginning; you'll then detect a bunch of consecutive zeros at the end, not the beginning.

2. So far, so good - but if you've 500 files in one directory, you could be doing 3*500*500 = 750,000 comparisons. Not good. So now you need to group similar files. And you need to do this before you've found the common parts of file names. For this, I'd create four hash tables - designed so that you are likely to get multiple hits on a given hash value (so you'll store a list of the hits under any given hash). The three tables will hash distinct parts of each file name:
A. the first four alpha characters of the file name (i.e. skip leading digits, punctuation and spaces).
B. the last four alpha characters of the file name
C. the string of alpha characters between the first pair of "-" characters, if such a string exists (don't bother adding names to the hash table if they don't hit).
D. like C, but between the last pair of "-" characters.

If there are other commonly used characters that punctuate parts of the file names, allow them as valid alternatives to "-" in C & D; you might create distinct hash tables for each distinct punctuation character.

By eliminating digits we eliminate track numbers from the hash tables - which are variable (and variable length) and can sometimes mimic band or album names - and so eliminate the need to shift before generating the hash.
By hashing on first & last & between delimiters we're generating indexes that are likely to hit on some recurring substring: all file names that share the same substring will land in the same hash bucket. Now your string differencing only needs to happen on the groups of file names within any given bucket - and you can safely skip making comparisons between files in different buckets from each other.

Note that some groups of file names may show up in more than one hash table - where a file name is like your 10,000 Maniacs example, both the band name and the album name will form a cluster in a hash table, but the band name will cluster in hash table A and the album name will cluster in table C.

Also note that you may need to look at limit cases - where a band name, or album name, contains fewer than four characters. In this case you would likely find that the four selected characters land in different hash buckets because of a differing first or last character. If you know you have cases where this might happen, you could consider modifying the 4 character rule to be '4 characters, unless there are fewer before I hit a digit or punctuation mark'.

Hope some or all of this helps. I'll be looking to see whether it works for you...

Griffin




Use Case:
We have music files that describe, in their filename, attributes of the
music.
We do not know a general pattern that applies to all filenames -- but
we do know that filenames that are clustered together (by for example
directory) will, most likely, have the same filename pattern.

Here is an example:
10,000 Maniacs - MTV Unplugged - 01 - These are Days.mp3

on its own, we can deduce little, except perhaps that the 01 is a
tracknumber (see my previous post). But if we know that all filenames
in the same directory, belong to the same album, but will have
different titles, then maybe, just maybe we can xform this filename
into its component parts.

So, if we have in our possession say, 2 more filenames:
10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3

Then we should be able to parse all the filenames into:
<var1>,<var2>,<tracknum>,<title>

Even if we don't know whether 10,000 Maniacs or MTV Unplugged is the
name of the Album or the Artists (this bit can be deduced by other
heuristics).

Now, it may be that a cluster of filenames has fewer properties:
4 Non Blondes - Train.mp3
4 Non Blondes - No Place Like Home.mp3

In this case the 4 is part of the Artist name which can be deduced,
since two tracks wouldn't have the same tracknumber.

As more example clusters:
01 What's the Matter Here.mp3
02 Hey Jack Kerouac.mp3

50 Cent - Back Down.mp3
50 Cent - Gotta Get.mp3

10 Just Like Anyone Aimee Mann Bachelor No. 2 Rock.mp3
09 Driving Sideways Aimee Mann Bachelor No. 2 Rock.mp3

This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?
 
S

skavan

Very intresting approach.
You say "If you represent the strings as arrays of bytes you can then
"subtract" one string from another"...
Forgive the dumb question - but how do you subtract one byte array from
another in vb.net (or C#)?
Standard operators don't work!

Seems to me, I should take your byte array concept and do the
following:
1. Assign each filename to a string: s1, s2
2. Determine which string is longer and by how many chars: N (assume s1
is longer in this example).
3. Add N chars to end of s2 and place in s3
4. Add N chars to start of s2 and place in s4
5. Convert s1,s3 and s4 to bytearrays: b1, b3, b4
6. Subtract b3 from b1. If it starts with 0's it's CASE A), if not it's
(CASE B).
7. Subtract b4 from b1. If it ends with 0's (CASE C), if not it's (CASE
D).
8. CASE A + CASE C = fixed fields at eather end, variable in the middle
9. CASE A + CASE D = fixed fields at the beginning, variable at the
end.
10. CASE B + CASE C = variable field at the beginning, thereafter
fixed.
11. CASE B + CASE D = ERROR!!! no fixed matches...or variable at both
ends!

12. at this point, one can begin pulling the fixed fields and parsing
them.

Make sense?

Back to the question...short of iterating through the byte array, one
item at a time, how does one subtract two byte arrays? If I have to
iterate, why bother converting from string to byte array?
 
S

skavan

....oh and to handle the 11 vs 1 issue (2 chars vs 1), maybe I convert
all numbers to XX (regex) before I start the sequence above?
 
Top