How to write unicode-characters to a RTF-doc ?

J

john

I need to produce a RTF-document which is filled with
data from a database.
I've created a RTF-document in WordPad (a template,
so to speak) which contains 'placeholders', for example
'<dd01>', '<dd02>', etc.

I read the entire template into a StringBuilder and
then perform a simple 'replace' on it, using a Hashtable.
The keys in the Hashtable are strings representing the
placeholders and the Hashtable's values contain data
from the database.

After the replace-action, I write the content of the
StringBuilder to a file with extension '.rtf'
(See code below).

It works like a charm, I can read the file with Word
(or WordPad) and it looks alright.

---

But ... Problems arise when the data from the database
contains characters like é ë è ï ó ö, etc. (Are these
called 'unicode-characters' ?)

These characters get converted to 'gibberish' when viewing
the generated rtf-doc in Word.
Then I thought that I probably needed to add 'Encoding.Unicode'
when writing the file, but when I do that, the generated file
is no longer recognized by Word as a valid RTF-doc.
Word then complains, 'this is an encoded file, install importfilters,
etc ...'.


My two questions now are :

1. How can I write unicode-characters to my RTF-template 'the
right way' ?

2. Why doesn't Word recognize a simple RTF-document no longer
after it was written using 'Encoding.Unicode' ?
I thought a RTF-document is basically just plain text (however
containing a lot of mark-up code) and by using 'Encoding.Unicode',
I'm only telling, 'this plain-text may contain unicode-characters'.
Right ?

//---

This is the code :

private void writeForm(string pathTemplate, string pathTempFile, Hashtable formData)
{
TextReader syncReader = TextReader.Synchronized(new StreamReader(pathTemplate));
TextWriter syncWriter = TextWriter.Synchronized(new StreamWriter(pathTempFile));

StringBuilder emptyTemplate = new StringBuilder(syncReader.ReadToEnd());
StringBuilder filledDoc = fillTemplate(emptyTemplate, formData);
syncWriter.Write(filledDoc);

syncReader.Close();
syncWriter.Close();
}

private StringBuilder fillTemplate(StringBuilder doc, Hashtable formData)
{
IDictionaryEnumerator myEnumerator = formData.GetEnumerator();
while (myEnumerator.MoveNext())
{
System.Diagnostics.Debug.WriteLine("1 : " + (string) myEnumerator.Value);
doc = doc.Replace( ( (string) myEnumerator.Key), (string) myEnumerator.Value);
}
return doc;
}

//---
 
J

Jon Skeet [C# MVP]

I need to produce a RTF-document which is filled with
data from a database.
I've created a RTF-document in WordPad (a template,
so to speak) which contains 'placeholders', for example
'<dd01>', '<dd02>', etc.

I read the entire template into a StringBuilder and
then perform a simple 'replace' on it, using a Hashtable.
The keys in the Hashtable are strings representing the
placeholders and the Hashtable's values contain data
from the database.

After the replace-action, I write the content of the
StringBuilder to a file with extension '.rtf'
(See code below).

It works like a charm, I can read the file with Word
(or WordPad) and it looks alright.

---

But ... Problems arise when the data from the database
contains characters like é ë è ï ó ö, etc. (Are these
called 'unicode-characters' ?)

Well, all characters in .NET are Unicode.
These characters get converted to 'gibberish' when viewing
the generated rtf-doc in Word.

Okay. I think you need to find the specifications for RTF and work out
which encoding to use. By default, StreamWriter will be using UTF-8. It
sounds like that's no good for you, but you shouldn't just pick
encodings at random - you could find one which appears to work, but
fails with some data you don't test it with.

Looking at the docs at www.wotsit.org, it looks like it *is* possible
to specify encodings, but that Word doesn't understand UTF-8 encoded
text. You may need to "manually" encode (with \UN) characters which
aren't in the appropriate code-page - I'd go with anything non-ASCII.
2. Why doesn't Word recognize a simple RTF-document no longer
after it was written using 'Encoding.Unicode' ?
I thought a RTF-document is basically just plain text (however
containing a lot of mark-up code) and by using 'Encoding.Unicode',
I'm only telling, 'this plain-text may contain unicode-characters'.
Right ?

No - it's entirely changing what the file looks like. See
http://www.pobox.com/~skeet/csharp/unicode.html to understand what
Encodings are about.
 
J

john

Well, all characters in .NET are Unicode.


Okay. I think you need to find the specifications for RTF and work out
which encoding to use. By default, StreamWriter will be using UTF-8. It
sounds like that's no good for you, but you shouldn't just pick
encodings at random - you could find one which appears to work, but
fails with some data you don't test it with.

Looking at the docs at www.wotsit.org, it looks like it *is* possible
to specify encodings, but that Word doesn't understand UTF-8 encoded
text. You may need to "manually" encode (with \UN) characters which
aren't in the appropriate code-page - I'd go with anything non-ASCII.


No - it's entirely changing what the file looks like. See
http://www.pobox.com/~skeet/csharp/unicode.html to understand what
Encodings are about.


Thank you very much, Jon.
I'm gonna study your page on unicode.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top