textbox to normal text

J

Jack Sons

Hi all,

I scanned a document of may pages. The result (a rtf-file) looks fine, but
in reality the text I see is not "text in a document" but text in textboxes.

I really need this to convert to text "directly in the document", like in
any "normal" document. I mean that it will be as if I typed it directly into
the document.

Of course I could select (highlight) the text in the first textbox and than
paste it to a new document (a doc-file), do the same with the text of the
next textbox, past it below the first text in the new docment etc. I tried,
did it for a lot of textboxes, but it will be very tedious to do it with the
whole document because of the many hundreds - maybe thouthands - of
textboxes, some of which contain only a single line of text..

Also there is a strange effect, when I try to "control c - control v " the
highlighted text of a textbox to the other document, suddenly it is not the
text that is copied to the new document, but the whole textbox, and so it
just moved the problem from one document to the other one.

Can anyone show me a way out? Perhaps with VBA it will be possible to
convert all textboxes at once to normal text.

I am in very urgent need for advice. Please help.

Jack Sons
The Netherlands
 
G

Graham Mayor

OCR software that formats the document using text boxes is a nightmare to
edit. You might find it simpler to use the plain text output of the software
and apply your own editing.


--
<>>< ><<> ><<> <>>< ><<> <>>< <>><<>
Graham Mayor - Word MVP

My web site www.gmayor.com

<>>< ><<> ><<> <>>< ><<> <>>< <>><<>
 
G

Greg Maxey

Jack Sons,

Yes Graham is probably right. I cobbled together the following which first
converts textboxes to frames and then removes the frame.

Sub ScratchMacro()
'Convert textbox text to plain text
Dim oShp As Shape
Dim i As Integer
For Each oShp In ActiveDocument.Shapes
If oShp.Type = msoTextBox Then oShp.ConvertToFrame
Next oShp
For i = ActiveDocument.Frames.Count To 1 Step -1
With ActiveDocument.Frames(i)
.Borders.Enable = False
With .Shading
.Texture = wdTextureNone
.ForegroundPatternColor = wdColorAutomatic
.BackgroundPatternColor = wdColorAutomatic
End With
.Delete
End With
Next
End Sub
 
J

Jack Sons

Graham,

I am an absolute newbie to scanning. Some months ago I bought an at that
time rather expensive HP scanner (the one with the detached glass frame, HP
4670) and used it now for the first time. I just put the glass frame over
the book to scan som 50 pages to MS WORD and yes, after the scanning process
was completed I found on my PC screen the resulting rtf-document.

I know it sounds stupid, but I have no idea what you mean by "use the plain
text output of the software and apply your own editing".

My own editing, that's what I want. But how do I get "the plain text output
of the software"? Because I use XP there was no need to install any software
(if I remenmber it well), I just plugged in the scanner and after XP
recognised it, it would function.

Please enlighten me on how to get te plain text output, which is apparently
exactly what I need.

Thousands thanks in advance.

Jack.
 
J

Jack Sons

Greg,

Thank you for your macro, it worked.

How did it work? I think it converts each textbox to a frame without
(visible) borders. What is the essential difference between a textbox and a
frame ?
And what is done with the frames, I can't find them in the result. To me it
looks like a normal document, without any objects, just characters as it
should be.

Would the result of using "the plain text output of the software", as Graham
advised, (if I would know how to do that) give a different result?

Before and after the use of the macro the resulting document is a rtf-file
(result.rtf). What does that extension inplicate? Can I rename it as a
doc-file (result.doc) without repercussions?

Last question (for now): why does the scanning process result in a textbox
output in stead of "normal text"?

Jack.
 
G

Graham Mayor

I am not familiar with the workings of your particular scanner or software,
but there will certainly be an option to scan to text rather than Word. Try
the help file.

--
<>>< ><<> ><<> <>>< ><<> <>>< <>><<>
Graham Mayor - Word MVP

My web site www.gmayor.com

<>>< ><<> ><<> <>>< ><<> <>>< <>><<>
 
J

JulieD

Hi Jack

just wandered off to look at your product's documentation on the HP site
(www.hp.com) and the user manual isn't that helpful :) however, you can
have a "real time chat" with a support technician on the site (as far as i
can tell it's free!) - so that might be the thing to do. They'll be able to
take you (hopefully) step by step the process of scanning and getting the
output you want.

Cheers
JulieD
 
G

Greg Maxey

Jack,

I am afraid that my usefulness to you has about run its course :)

The code does look at all shapes, if the shape is a textbox it converts it
to a frame, removes any borders and fill effects from the frame and then
deletes the frame leaving the text. I found through experimentation that if
I just deleted the frames then any border and fill effects in the frame
would be transfered to the text paragraphs.

I will have to defer to others as to the technical difference between a
frame and textbox.

RTF is, I think, "Raw Text Format." I have never monkeyed around very much
with differenct types of text, but why don't you just try saving your RTF
file as a Word.doc and see what happens :)

I have a hard time figuring out the workings of a simple screw, so I can't
be of much help with the workings of your scanner. Sorry.
 
J

Jay Freedman

Hi Jack,

I'll try to follow up Greg's musings and shed what light I can...

In the macro, the line ".Delete" removes the frame and leaves the
text. That should give you what you need -- plain text -- but there
may be a wrinkle, which I'll explain after a bit.

A frame and a textbox are similar in some ways, but the big difference
is that a textbox is in the "drawing layer" while the frame is in the
"text layer". That is, Word thinks of the textbox as a sort of
picture, while a frame is more like special formatting of text. You
can include a frame as part of a paragraph style, which you can't do
with a textbox. The ability to transform a textbox into a frame is
truly magical and involves some very fancy programming inside Word.

RTF is actually "Rich Text Format", and it's a way to use a file of
plain text to describe all sorts of formatting. If you open an RTF
file in NotePad, you'll see a ton of codes in braces that describe
fonts, page locations, and lots of other things. When you tell Word to
open an RTF file, a special converter program reads all those codes
and applies the formatting to the text part, resulting in what looks
like a regular Word document. You can then save that as a .doc file,
whose structure is completely different.

When you scan a document, the initial result is just a picture of the
page. Many scanners will let you save that as a graphics file (usually
..tif or .jpg). You feed that picture into an optical character
recognition (OCR) program, which may be part of the scanner software
or may be a separately installed program. The output of the OCR is
text.

In the early days, you were doing well to get just a plain-text
reading of the document, with headers and footers and pictures all
jammed in there. As OCR programmers got better, they started offering
output of a word processing file that looked exactly (well, more or
less) like the original, with the proper fonts, bold/italic, headers,
and so forth. In order to get the stuff positioned correctly on the
page, they resorted to textboxes -- but that's really hard to deal
with when you want to edit the document.

Now the wrinkle... Every graphic object in Word's drawing layer has an
"anchor", a spot in the regular text to which it's attached. (You can
see the anchor symbol in the left margin of Page Layout view if you go
to Tools > Options > View and check "Object anchors", then select a
textbox or floating picture.) When you convert the textbox to a frame
and then delete the frame, the text inside gets dumped into the
regular text at the anchor position.

Many OCR programs put a single paragraph mark on a page, and anchor
all the textboxes on the page to that paragraph. When you run the
macro, the various chunks of text appear in the order in which their
anchors occurred in the original paragraph, which will probably be
more-or-less random. You're then left to untangle the spaghetti. :-(

This is why Graham's suggestion to output the scan (from the OCR
program) as plain text is a good one. You may lose the "looks just
like the original" formatting, but you'll also never create the
textboxes. This should make your editing job a whole lot simpler. Look
through the OCR program and its help file to find out where you can
turn off formatted output.
 
J

Jack Sons

Greg,

Your help was enormous!

What you wrote (I found through experimentation that if
I just deleted the frames then any border and fill effects in the frame
would be transfered to the text paragraphs.) enlightened me, now I
understand. Of course I thought that with deleting the frame and its border
one would also lose the text. Apparently that is not the case.

Thanks again.

Jack.
 
J

Jack Sons

Jay,

Your answer is absolutely clear, now I understand what is going on. I am
very grateful to you.

I hope to find the plain text output of the software when I am not so busy.
Gregs macro worked fine, as far as I could see all text came "on paper" in
the correct sequence.

Jack.
 
G

Greg Maxey

Jack,

To delete the text and the frame it would be

ActiveDocument.Frames(i).Range.Delete 'deletes the text
ActiveDocument.Frames(i).Delete 'deletes the frame
 
J

Jack Sons

Greg,

ActiveDocument.Frames(i).Delete is what your macro did, it deletes the frame
and leaves the text. I see it now in your code.

At first I thought that deleting the frame would also delete the text. But
now I understand that "frame" is just a rectangle of borderlines within the
text layer of the document.

ActiveDocument.Frames(i).Range.Delete will delete the text and keeps the
frame intact. So an empty frame (borders - if visible - without any text
inside) is the result?

Does that mean that you can't delete frame and text with one instruction?

Jack.
 
G

Greg Maxey

Jack,

That is my understaning, but I don't like to use "can't" as I am just a
novice with VBA.

Actually I suppose that I would write the code:

With ActiveDocument.Frames(i)
.Range.Delete
.Delete
End With

If we are wrong, Jay or one of the other Senseis will be along to set us
straight ;-)

Here is a little piece of code that creates a frame with text. If your
leave out part of the code that inserts the text then you just have an empty
frame:

Public Sub FrameMaker()



Dim MyFrame As Frame

Dim MyRange As Range



Set MyRange = Selection.Range

Set MyFrame = ActiveDocument.Frames.Add(MyRange)



With MyFrame

'Add color

.Shading.BackgroundPatternColor = wdColorLightYellow

'Size it

.WidthRule = wdFrameExact

.Width = 460

.HeightRule = wdFrameExact

.Height = 20

'Position it

.RelativeHorizontalPosition = wdRelativeHorizontalPositionPage

.RelativeVerticalPosition = wdRelativeVerticalPositionParagraph

.HorizontalPosition = 75

.VerticalPosition = 0

'Add some text

.Range.InsertAfter "This is a macro created frame!"

End With

Set MyFrame = Nothing

End Sub
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top