PC Review


Reply
Thread Tools Rate Thread

Determine language of body of text?

 
 
Mark B
Guest
Posts: n/a
 
      23rd Oct 2008
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.

 
Reply With Quote
 
 
 
 
Arto Viitanen
Guest
Posts: n/a
 
      23rd Oct 2008
Mark B wrote:
> Does anyone have a method to determine the language (e.g. en-US, fr-FR,
> zh-TW etc) of a body of text?
>
> (In my case the body of text is an email received).
>
> Naturally the text, if not English, may have a little bit of English.
> I'm just trying to get the main language used.


I thought of following: take dictionary of the common words of the
language you interested in. Then for each word of the text, calculate
times the word occures. But, this needs several versions of the word;
for example "word" and "words". On some languages this is not possible,
since there can be so many variations of a single word.

But, check article "Language Trees and Zipping" by Dario Benedetto,
Emanuele Caglioti and Vittorio Loreto, downloadable from
http://xxx.uni-augsburg.de/format/cond-mat/0108530 . It seems there is
also perl implementation of the algorithm :
code.activestate.com/recipes/355807 . If I understood it right, zip
archiver is based on the idea that it tries to learn the sequence and
the more it learns (i.e. the bigger the text), the better it compresses.
When you teach zip with English text and then give it two texts A and B;
if A is english it is compressed better than B which is italian.

--
Arto Viitanen
 
Reply With Quote
 
Mark B
Guest
Posts: n/a
 
      23rd Oct 2008
http://code.google.com/apis/ajaxlang...tation/#Detect

"Mark B" <(E-Mail Removed)> wrote in message
news:u$(E-Mail Removed)...
> Does anyone have a method to determine the language (e.g. en-US, fr-FR,
> zh-TW etc) of a body of text?
>
> (In my case the body of text is an email received).
>
> Naturally the text, if not English, may have a little bit of English. I'm
> just trying to get the main language used.


 
Reply With Quote
 
JP
Guest
Posts: n/a
 
      23rd Oct 2008
There's some code here that writes the internet headers of an email to
a text file, which you could then parse for the language. For example,
some email headers include a "Content-Type" line which indicates the
character set used.

i.e.: Content-Type: text/plain; charset=US-ASCII

http://blogs.technet.com/kclemson/ar...n-outlook.aspx

HTH,
JP

On Oct 23, 3:59*am, "Mark B" <none...@none.com> wrote:
> Does anyone have a method to determine the language (e.g. en-US, fr-FR,
> zh-TW etc) of a body of text?
>
> (In my case the body of text is an email received).
>
> Naturally the text, if not English, may have a little bit of English. I'm
> just trying to get the main language used.


 
Reply With Quote
 
Dmitry Streblechenko
Guest
Posts: n/a
 
      23rd Oct 2008
Look at the MailItem.InternetCodepage property (corresponds to the
PR_INTERNET_CPID MAPI propety).

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-
"Mark B" <(E-Mail Removed)> wrote in message
news:u$(E-Mail Removed)...
> Does anyone have a method to determine the language (e.g. en-US, fr-FR,
> zh-TW etc) of a body of text?
>
> (In my case the body of text is an email received).
>
> Naturally the text, if not English, may have a little bit of English. I'm
> just trying to get the main language used.



 
Reply With Quote
 
Mark B
Guest
Posts: n/a
 
      23rd Oct 2008
Do all incoming emails have this? Even if they are not originating from an
Outlook client?

(It's for an Outlook 2007 Add-in C# BTW).


"Dmitry Streblechenko" <(E-Mail Removed)> wrote in message
news:%23$(E-Mail Removed)...
> Look at the MailItem.InternetCodepage property (corresponds to the
> PR_INTERNET_CPID MAPI propety).
>
> --
> Dmitry Streblechenko (MVP)
> http://www.dimastr.com/
> OutlookSpy - Outlook, CDO
> and MAPI Developer Tool
> -
> "Mark B" <(E-Mail Removed)> wrote in message
> news:u$(E-Mail Removed)...
>> Does anyone have a method to determine the language (e.g. en-US, fr-FR,
>> zh-TW etc) of a body of text?
>>
>> (In my case the body of text is an email received).
>>
>> Naturally the text, if not English, may have a little bit of English. I'm
>> just trying to get the main language used.

>
>


 
Reply With Quote
 
Dmitry Streblechenko
Guest
Posts: n/a
 
      23rd Oct 2008
Most of them. But that really tells you more about the defaut code page of
the sender.

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-
"Mark B" <(E-Mail Removed)> wrote in message
news:%(E-Mail Removed)...
> Do all incoming emails have this? Even if they are not originating from an
> Outlook client?
>
> (It's for an Outlook 2007 Add-in C# BTW).
>
>
> "Dmitry Streblechenko" <(E-Mail Removed)> wrote in message
> news:%23$(E-Mail Removed)...
>> Look at the MailItem.InternetCodepage property (corresponds to the
>> PR_INTERNET_CPID MAPI propety).
>>
>> --
>> Dmitry Streblechenko (MVP)
>> http://www.dimastr.com/
>> OutlookSpy - Outlook, CDO
>> and MAPI Developer Tool
>> -
>> "Mark B" <(E-Mail Removed)> wrote in message
>> news:u$(E-Mail Removed)...
>>> Does anyone have a method to determine the language (e.g. en-US, fr-FR,
>>> zh-TW etc) of a body of text?
>>>
>>> (In my case the body of text is an email received).
>>>
>>> Naturally the text, if not English, may have a little bit of English.
>>> I'm just trying to get the main language used.

>>
>>

>



 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      26th Oct 2008
Mark B wrote:
> Does anyone have a method to determine the language (e.g. en-US, fr-FR,
> zh-TW etc) of a body of text?
>
> (In my case the body of text is an email received).
>
> Naturally the text, if not English, may have a little bit of English.
> I'm just trying to get the main language used.


If you are willing to write some code, then you can detect
the language (but probably not the regional dialect).

* dictionary with common words
* special letters (forward and backward accents, umlauts etc.)
* distribution of letters
* distribution of pairs of letters

Arne
 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Determine language of body of text? Mark B Microsoft VB .NET 7 26th Oct 2008 03:05 AM
Determine language of body of text? Mark B Microsoft Outlook VBA Programming 7 26th Oct 2008 03:05 AM
how to determine what language a c# string is written in? is there any C# method to take a string and return what language it is in? e.g. english, hindi, spanish, etc. DR Microsoft C# .NET 5 31st Jan 2008 02:19 AM
how to determine what language a c# string is written in? is there any C# method to take a string and return what language it is in? e.g. english, hindi, spanish, etc. DR Microsoft Dot NET Framework 3 30th Jan 2008 04:39 PM
how to determine what language a c# string is written in? is there any C# method to take a string and return what language it is in? e.g. english, hindi, spanish, etc. DR Microsoft Dot NET 1 30th Jan 2008 12:51 PM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 07:53 PM.