Can I copy text not meant for copying?

M

micky

When webpages and pdf files don't permit selecting and copying text,
is there a way around that? Not counting print screen, which I use
sometimes, but won't help when the place I want to copy to only allows
text.

Thanks
 
M

Mayayana

| When webpages and pdf files don't permit selecting and copying text,
| is there a way around that? Not counting print screen, which I use
| sometimes, but won't help when the place I want to copy to only allows
| text.
|

Those are different issues. A webpage can do things
like blocking the right-click menu, but only if you enable
script.

A webmaster can also do things like using an image
of text. I recently saw a page where the author had
gone to great lengths to block image copying by
loading the images in a Flash program. (Without script
and Flash enabled there are no pictures on the page!)

If there's text there you should be able to get it
by disabling script. You can also view the source code
to get at the text. And in most browsers you can view
with no style, which makes the right selection easier
in some cases.

PDFs are different. Adobe designed the PDF format to
allow for a number of restrictions. Text copying can be
blocked. A password can be required. Etc. Those
restrictions are actually just "flags" in the PDF file. There's
not really any kind of lock. But most software respects
the flag. So, if you have a PDF with a text-copying
restriction the only option is to get software that will
bypass it. I think there is such software, but not for free.

It's an odd issue. Since you have a right to the file you
have a right to access the text, but Adobe has tried to
mimic white collar procedure in order to impart a sense
of solidity to digital files. In doing that they've done their
best to render a PDF as an immutable file that mimics a
printed page, and is actually designed just to get business
docs transported via PC and printer rather than via postal
mail.
Unfortunately, people often restrict PDFs for no good
reason. (I once downloaded a state auto accident report
form that I had to file in triplicate, and the editing function
was blocked!)

In some cases a PDF is actually a collection of scanned
book pages. In that case there isn't any text. Your only
option is to run it through OCR software. But actually, these
days OCR software is quite good, and usually comes free
with a scanner.

There's a command line PDF extractor named XPDF.
I wrote a convenient wrapper for it here:
http://www.jsware.net/jsware/pdfconv.php5

With that you can extract text and images. As I
note on that page, Sumatra PDF can also extract
text and does a better job. XPDF is outdated.
But Sumatra doesn't extract images.

Both XPDF and Sumatra can be recompiled to
ignore restriction flags with a very small code edit.
They're both OSS. But both authors have chosen
to respect the restriction flags in their compile.
 
P

Paul

micky said:
When webpages and pdf files don't permit selecting and copying text,
is there a way around that? Not counting print screen, which I use
sometimes, but won't help when the place I want to copy to only allows
text.

Thanks

This company came up with a solution for PDF years ago.

http://www.elcomsoft.com/apdfpr.html

There are a couple aspects to PDF security. One is documents
protected with a password. And the other, is that stupid "copy"
setting. I'm convinced, that a lot of the time, the author of
the document has left the "copy" setting at its default, instead
of thinking it through.

In the past, there were a few recipes that involved using a
third party application to "launder" the PDF (just open it and
save again). As time passes, those kind of holes get plugged, so
you can't expect a recipe you found from 2005, to still work
in 2012. You might get lucky, and find a modern recipe like that,
or, you might be forced to go with something like Elcomsoft,
to have a fair chance.

When the "copy" setting first came out, I used to "launder"
documents with a modified copy of GhostScript, but PDFs
have come a long way since then. If you have access to the
source code for a PDF engine (an engine that supports all
the features of the language), then I would expect at
least the "no copy" feature could be turned off. (That's
because that feature relies, to a large extent, on the
"honor system". That no software writer will turn it off.)

Some of the other features might be a bit harder to crack
(if the whole document is password protected, then it's a
decryption problem, and not something that simply modifying
the source is going to fix). Then, it might depend on the
"hardness" of the encryption algorithm. Or known weaknesses.

Paul
 
M

Mayayana

| because that feature relies, to a large extent, on the
| "honor system". That no software writer will turn it off.
|

Interesting, isn't it, that everyone's all for honor....
unless you're willing to pay $100. Then you have a
right to access any PDF. :)
 
J

jim

When webpages and pdf files don't permit selecting and copying text,
is there a way around that? Not counting print screen, which I use
sometimes, but won't help when the place I want to copy to only allows
text.

Thanks


There are a few things that 'sometimes work'. One of them is that
sometimes you can do a CTRL A and select the entire page, then copy it
with CTRL C, though you will then have to snip away the parts you don't
need when you paste it.

jim
 
J

jim

On Thu, 8 Mar 2012 09:50:15 -0500, in microsoft.public.windowsxp.general,
| because that feature relies, to a large extent, on the
| "honor system". That no software writer will turn it off.
|

Interesting, isn't it, that everyone's all for honor....
unless you're willing to pay $100. Then you have a
right to access any PDF. :)

I recently had someone who wanted to copy from a historical opinion file
-- author dead, etc., and i really saw no reason why it was
'copy-protected'. The first thought i had was that the options were
simply a set of flags and so i looked for the header format for a pdf
file, thinking it would give the location there. I had a hex change in
mind. I never found the format but in the looking, realized that even if
i were writing it I would prevent that type of change by a simple hash
check on the field or something..........

jim
 
K

Ken Blake, MVP

There are a few things that 'sometimes work'. One of them is that
sometimes you can do a CTRL A and select the entire page, then copy it
with CTRL C,



I can only speak from my own experience, but I can not remember a time
when Ctrl-A didn't work (and I use it a lot).

Ken Blake, Microsoft MVP
 
M

micky

Good idea. I'll try it. (I shoudl have thought of that. :-( )
Good to hear!
Ctrl-A can always copy just the text, and not the image? I'll have to try
that again sometime. I have used Ctrl-A to copy all the text in a text
document, but don't recall trying it on a PDF file.

Me neither, but I'll try it there too.

And for once, I remember which pages and files do this. (In a few
days, I may post the url if this doesn't work.)
 
P

Paul in Houston TX

micky said:
When webpages and pdf files don't permit selecting and copying text,
is there a way around that? Not counting print screen, which I use
sometimes, but won't help when the place I want to copy to only allows
text.

Thanks

Web pages and pdf are two different things.
The pdf's are likely pictures of print rather than
actual print. Your best bet is ocr.
Irfan has one that works somewhat. There are others.
If you have a scanner, it should have a separate ocr ability.
Web pages are often set to no copy, no anything.
Print screen and ocr them.
 
M

micky

Good idea. I'll try it. (I shoudl have thought of that. :-( )

Good to hear!

Me neither, but I'll try it there too.

And for once, I remember which pages and files do this. (In a few
days, I may post the url if this doesn't work.)

It didn't work! This is the webpage:

http://www.tropicana.com/#/trop_products/productsLanding.swf?TropicanaPurePremium

I wanted to get the 3 lines of black text above the 5 bottles***.


On a pdf file, it lets me select all -- it's even in the drop down
menu --, but it doesn't let me copy it/paste it. And copy is not in
the drop down menu. It will take me a while to find the url for this
file from a stock broker, because now I'm working with a downloaded
copy. I plan to post again.


***I want to be able to do this if possible regardless, but FYI the
immediate cause was:

I havent' finished investigating this brand, and it's the first one I
thought of, but I learned recently that some or most orange juice
labeled "not from concentrate" (especially probably mass market
brands, not expensive boutique brands) may not have been
concentrated, but they use juice stored up to a year, after the oxygen
has been removed from it, etc.

http://www2.macleans.ca/2009/05/19/fresh-from-the-press/#more-57383
this url is almost 3 years old, but I heard this on the radio? news
like it was recent.

http://civileats.com/2009/05/06/freshly-squeezed-the-truth-about-orange-juice-in-boxes/
This is from the same month, but may only be about juice in boxes.

http://articles.mercola.com/sites/a...ificially-flavored-to-taste-like-oranges.aspx
This one is from last august, though I don't trust "healt" companies
that advertise.
 
P

Paul

micky said:
It didn't work! This is the webpage:

http://www.tropicana.com/#/trop_products/productsLanding.swf?TropicanaPurePremium

I wanted to get the 3 lines of black text above the 5 bottles***.

I have two web browsers. One with Adobe Flash installed, and one without.
The "5 bottles" only appear in the Flash based version of the webpage.
The text in this case, is in a Flash image, and is not text "you can wipe over".

The non-Flash equipped browser, shows a quite different page. I was
able to copy this text from the non-Flash page. The page is
entirely different, with different text. I got this via copy/paste
of the non-Flash page (with anything needing Unicode, removed).

"We're committed to using the best fruit to give you the great tasting juices
you love and the nutrition your body needs. Each 59oz container of Tropicana
Pure Premium has 16 fresh-picked oranges squeezed into it and an 8oz glass
gives you 100% vitamin C to help you maintain a healthy immune system."

The claim of 100% vitamin C, I guess that means your glass is filled to the
rim with dried Ascorbic Acid crystals :) Linus Pauling would be overjoyed.

http://en.wikipedia.org/wiki/Ascorbic_acid
On a pdf file, it lets me select all -- it's even in the drop down
menu --, but it doesn't let me copy it/paste it. And copy is not in
the drop down menu. It will take me a while to find the url for this
file from a stock broker, because now I'm working with a downloaded
copy. I plan to post again.

When you have a PDF to play with, post the URL, so we can try our
own bags of tricks :) I haven't tried "copy busting" in a while.

Paul
 
J

jim

On Fri, 09 Mar 2012 00:21:32 -0500, in
When you have a PDF to play with, post the URL, so we can try our
own bags of tricks :) I haven't tried "copy busting" in a while.

Paul


I recently tried to bust one for a friend -- trying three different
utilities, one was XPDF using command line options in pdftotext.exe which
*claimed* to bypass restrictions and failed spectacularly (i may have
done it wrong). I have a query out for that URL now and will post it
if/when i get it.

jim

jim
 
P

Paul

jim said:
On Fri, 09 Mar 2012 00:21:32 -0500, in



I recently tried to bust one for a friend -- trying three different
utilities, one was XPDF using command line options in pdftotext.exe which
*claimed* to bypass restrictions and failed spectacularly (i may have
done it wrong). I have a query out for that URL now and will post it
if/when i get it.

jim

If you do "properties", take a look at the security settings,
while viewing the document in Acrobat Reader. Perhaps there
is something there to explain why it can't be busted. Maybe
Adobe had enough time to re-think how to fix the "honor" system...

Paul
 
M

Mayayana

| I recently tried to bust one for a friend -- trying three different
| utilities, one was XPDF using command line options in pdftotext.exe which
| *claimed* to bypass restrictions and failed spectacularly

I wrote about that in my earlier post. XPDF claims
no such thing. In fact, the author has specifically
written an explanation saying that he doesn't feel
right about bypassing restrictions.

http://www.foolabs.com/xpdf/cracking.html

XPDF is also outdated, and never worked all that
well in the first place. It actually only requires a
very small edit to make pdftotext.exe ignore restrictions:

In pdftotext.c one just needs to comment out the
permission check:

// check for copy permission
/*
if (!doc->okToCopy()) {
error(-1, "Copying of text from this document is not allowed.");
exitCode = 3;
goto err2;
}
*/

Unfortunately, one also needs to be capable of
recompiling the software.

I looked around, at one point, for a program that
ignores restrictions and found that it seems to be
mostly a commercial thing. If you don't mind paying,
you can have the functionality. But for some reason
the OSS people "respect" the design of PDFs, which is
unfortunate since, as Paul said, most copy-protected
PDFs seem to be that way simply because the author
wasn't paying attention to the settings.
 
J

jim

On Sat, 10 Mar 2012 08:23:18 -0500, in
On Fri, 09 Mar 2012 00:21:32 -0500, in



I recently tried to bust one for a friend -- trying three different
utilities, one was XPDF using command line options in pdftotext.exe which
*claimed* to bypass restrictions and failed spectacularly (i may have
done it wrong). I have a query out for that URL now and will post it
if/when i get it.

jim


Looks like this is it:
http://rapidlibrary.com/files/2discoverislam-com-riyad-us-saliheen-pdf_ulfxvqy9wwi89on.html

jim
 
J

jim

On Sat, 10 Mar 2012 09:08:36 -0500, in
If you do "properties", take a look at the security settings,
while viewing the document in Acrobat Reader. Perhaps there
is something there to explain why it can't be busted. Maybe
Adobe had enough time to re-think how to fix the "honor" system...

Paul


He is a savvy fellow who told me he was using printscreen and then an OCR
app., so i dropped it. I know he would prefer to do "select text" and
"copy to clipboard".

jim
 
J

jim

On Sat, 10 Mar 2012 09:09:33 -0500, in
| I recently tried to bust one for a friend -- trying three different
| utilities, one was XPDF using command line options in pdftotext.exe which
| *claimed* to bypass restrictions and failed spectacularly

I wrote about that in my earlier post. XPDF claims
no such thing. In fact, the author has specifically
written an explanation saying that he doesn't feel
right about bypassing restrictions.

http://www.foolabs.com/xpdf/cracking.html

XPDF is also outdated, and never worked all that
well in the first place. It actually only requires a
very small edit to make pdftotext.exe ignore restrictions:

In pdftotext.c one just needs to comment out the
permission check:

// check for copy permission
/*
if (!doc->okToCopy()) {
error(-1, "Copying of text from this document is not allowed.");
exitCode = 3;
goto err2;
}
*/

Unfortunately, one also needs to be capable of
recompiling the software.

Well, that, and one would need the source code -- or be very clever in
locating the command within the compiled code, then nulling it, etc...

As far as being outdated, yes of course it is, but I am a fan of
"outdated" since that usually mean that you are going to bypass
GUI-cleverness.

jim
 
P

Paul

jim said:
He is a savvy fellow who told me he was using printscreen and then an OCR
app., so i dropped it. I know he would prefer to do "select text" and
"copy to clipboard".

jim

It's looking like some kind of font encoding problem, rather than
copy prevention. Still working on it...

filename = 2discoverislam_com_riyad_us_saliheen.pdf
type = PDF 1.4
size = 5638888 bytes
md5sum = df45ea78241da54c928ba8b91c94c59e

Paul
 
J

jim

On Sat, 10 Mar 2012 10:35:01 -0500, in
| As far as being outdated, yes of course it is, but I am a fan of
| "outdated" since that usually means that you are going to bypass
| GUI-cleverness.
|

I don't see the connection.

That's OK. ;-)
 
M

micky

I have two web browsers. One with Adobe Flash installed, and one without.
The "5 bottles" only appear in the Flash based version of the webpage.
The text in this case, is in a Flash image, and is not text "you can wipe over".

The non-Flash equipped browser, shows a quite different page. I was
able to copy this text from the non-Flash page. The page is
entirely different, with different text. I got this via copy/paste
of the non-Flash page (with anything needing Unicode, removed).

Wow, I got a techical answer from you, and the text I wanted too!! I
don't suppose I can call you every time I wante do copy text. No,
probably not.
"We're committed to using the best fruit to give you the great tasting juices
you love and the nutrition your body needs. Each 59oz container of Tropicana
Pure Premium has 16 fresh-picked oranges squeezed into it and an 8oz glass
gives you 100% vitamin C to help you maintain a healthy immune system."

I have to check if Tropicana is one of those that can keep
fresh-picked squeezed oranges in a vat for months. Haven't had time.
I guess that would mean they are picket when they're fresh, not that
they're sold when they are.
The claim of 100% vitamin C, I guess that means your glass is filled to the
rim with dried Ascorbic Acid crystals :) Linus Pauling would be overjoyed.

He deserves it.
http://en.wikipedia.org/wiki/Ascorbic_acid


When you have a PDF to play with, post the URL, so we can try our
own bags of tricks :) I haven't tried "copy busting" in a while.

Okay.

Here it is:. I found this on the web under a shorter, more sensible
name, but this is the same:
http://fa.morganstanleyindividual.c...V265/e5ce4dc0-c4c1-4f27-a1df-37e47b92e0b3.pdf

When you download this, it has an entry in the Edit drop-down liast
for Select All, but the entries for Copy and Cut are greyed out, and
cntl-C doesn't work either.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top