telling the language of a unicode string

N

Nir

Hello,

I have a simple Unicode string in my c# application. I would like to
tell which languages this string contains. For example, if this string
have some letters in Greek and English than I would like to tell that
to my user.

Obviously I can't identify languages which has the same letters subset
(for example German and English), but that's ok. Telling which letters
subsets exist is enough for me.

I thought of using the following table:
http://msdn2.microsoft.com/en-us/library/ms776439.aspx

And than iterate over each letter in my string in order to check for
which Unicode
subrange it fits (and according to that know what language this letter
belongs to).

If there is a better way I'll appreciate if you share it.

Thank you very much,
Nir Levy.
 
I

Ignacio Machin \( .NET/ C# MVP \)

Hi,


Not that I know of, you will have to iterate in the input string , that's
for sure.
 
J

Jesse Houwing

The SpamAssassin project has a function to get the language from an email
based on the combinations of characters that onyl go together in certain
languages. It can give you guess on the language. It shouldn't be too hard
to port that from perl to .NET.

Jesse


Hello Clive Dixon" clived at digita dot com,
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Nir said:
I have a simple Unicode string in my c# application. I would like to
tell which languages this string contains. For example, if this string
have some letters in Greek and English than I would like to tell that
to my user.

Obviously I can't identify languages which has the same letters subset
(for example German and English), but that's ok. Telling which letters
subsets exist is enough for me.

Actually you can distinguish between english and german.

If you look at the frequency of single characters and
two character combinations then you will get a strong
indication of language.

Arne
 
M

Mihai N.

I have a simple Unicode string in my c# application. I would like to
tell which languages this string contains. For example, if this string
have some letters in Greek and English than I would like to tell that
to my user. ....
If there is a better way I'll appreciate if you share it.

Good starting point: http://www.sfs.uni-tuebingen.de/iscl/Theses/kranig.pdf
Also has a some good links in the "References" section.

Searching for <<language identification perl>> and <<Language Identification
Algorithm>> got me a lot of relevant links.
 
U

UL-Tomten

You could perhaps do something with regular expressions

An alternative solution could perhaps be to iterate over all
characters, make a list of only the ones that are letters (using
Char.GetUnicodeCategory), and then keeping count of the unicode ranges
involved, considering the most frequent one to be the winner. For
specific ranges, you could then specialize to distinguish further.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top