Identifying full-width/zenkaku (Japanese) characters within a document...

M

mlg4035

Hi all,

I work in the translation business (Japanese -> English) and have bee
tasked with creating a Word (Office XP) macro to identify stra
double-byte(zenkaku) characters that have been missed during th
translation process.

We use Trados 6.5 in conjunction with Word and invariably we find a fe
double-byte characters or words mixed in with the rest of th
single-byte text that we need to find and change.

My group leader created a macro that simply examines each word, one b
one (!) to determine if it contains double-byte characters or not, an
if it does, it flags it with '@' and changes the font color to red s
we can go through later and manually correct the text.

The following is a code snippit that contains the loop the macro use
to go through the text character by character:


Code
-------------------

WordCount = ActiveDocument.Content.Words.Count

i = 1

While i < WordCount + 1

Application.StatusBar = i & "/" & WordCount & " ƒ[ƒh‚ðˆ—‚µ‚Ä‚¢‚Ü‚·..."

LenOrgWord = LenB(ActiveDocument.Content.Words(i))
LenCvtdWord = LenB(StrConv(ActiveDocument.Content.Words(i), vbFromUnicode))

If LenOrgWord = LenCvtdWord Then

With ActiveDocument.Words(i)
.InsertBefore ("—")
.Font.Color = wdColorRed
End With

i = i + 1

WordCount = WordCount + 1

End If

i = i + 1

Wend

-------------------


Needless to say this is extremely SLOW when you have a 300+ pag
document to check! So I am wondering if perhaps there is a mor
efficient way of going about this?

I was tempted to simply search using Find and setting Format>Font t
'‚l‚r –¾’©' but I can't figure out how to specify that I'm only lookin
for full-width(zenkaku), ‚l‚r –¾’© characters and no
half-width(hankaku), ‚l‚r –¾’© characters (i.e. as long as we can fin
all the zenkaku characters in the document, it doesn't matter i
they're using a Japanese font).

Is there, perhaps, a regex I can use with Find (e.g. ([A-Z]*), etc.)?

Thanks in advance for your help.
MLG403
 
M

mlg4035

Well, then let's put it another way:

How can I find all non-ascii, non-ansi characters in a document,
excluding formatting marks(paragraph marks, tabs, etc.)?

Using the built-in Find function I have managed to get pretty close
using this wildcard string for 'Find what':


Code:
--------------------

[!^0013-^0255^t^m^x^z^l^n^11\@\* ]*

--------------------


But it is still finding some sort of zero-width character within every
cell of every table just before the carriage return. It's not
looking at the carriage returns themselves (because the wildcard/regex
ignores those), so I must be missing some sort
of hidden/zero-width mark within the tables...

Any ideas? Thanks.
MLG4035
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top