Identifying full-width/zenkaku (Japanese) characters within a document...

mlg4035 · Sep 27, 2004

Hi all,

I work in the translation business (Japanese -> English) and have bee
tasked with creating a Word (Office XP) macro to identify stra
double-byte(zenkaku) characters that have been missed during th
translation process.

We use Trados 6.5 in conjunction with Word and invariably we find a fe
double-byte characters or words mixed in with the rest of th
single-byte text that we need to find and change.

My group leader created a macro that simply examines each word, one b
one (!) to determine if it contains double-byte characters or not, an
if it does, it flags it with '@' and changes the font color to red s
we can go through later and manually correct the text.

The following is a code snippit that contains the loop the macro use
to go through the text character by character:

Code
-------------------

WordCount = ActiveDocument.Content.Words.Count

i = 1

While i < WordCount + 1

Application.StatusBar = i & "/" & WordCount & " ƒ[ƒh‚ðˆ—‚µ‚Ä‚¢‚Ü‚·..."

LenOrgWord = LenB(ActiveDocument.Content.Words(i))
LenCvtdWord = LenB(StrConv(ActiveDocument.Content.Words(i), vbFromUnicode))

If LenOrgWord = LenCvtdWord Then

With ActiveDocument.Words(i)
.InsertBefore ("—")
.Font.Color = wdColorRed
End With

i = i + 1

WordCount = WordCount + 1

End If

i = i + 1

Wend

-------------------

Needless to say this is extremely SLOW when you have a 300+ pag
document to check! So I am wondering if perhaps there is a mor
efficient way of going about this?

I was tempted to simply search using Find and setting Format>Font t
'‚l‚r –¾’©' but I can't figure out how to specify that I'm only lookin
for full-width(zenkaku), ‚l‚r –¾’© characters and no
half-width(hankaku), ‚l‚r –¾’© characters (i.e. as long as we can fin
all the zenkaku characters in the document, it doesn't matter i
they're using a Japanese font).

Is there, perhaps, a regex I can use with Find (e.g. ([A-Z]*), etc.)?

Thanks in advance for your help.
MLG403

mlg4035 · Sep 28, 2004

Well, then let's put it another way:

How can I find all non-ascii, non-ansi characters in a document,
excluding formatting marks(paragraph marks, tabs, etc.)?

Using the built-in Find function I have managed to get pretty close
using this wildcard string for 'Find what':

Code:
--------------------

[!^0013-^0255^t^m^x^z^l^n^11\@\* ]*

--------------------

But it is still finding some sort of zero-width character within every
cell of every table just before the carriage return. It's not
looking at the carriage returns themselves (because the wildcard/regex
ignores those), so I must be missing some sort
of hidden/zero-width mark within the tables...

Any ideas? Thanks.
MLG4035

Identifying full-width/zenkaku (Japanese) characters within a document...

mlg4035

mlg4035