PDF file encodings

JohnThompsonJTSoftware939 · Postby **JohnThompsonJTSoftware939** » May 25th, 2012 5:42 am

I can't extract or copy/paste text from the PDFs. The result are non-displayable. Why is this?

For example, 002_B2_082007_kclass101_lesson.pdf.

Can you provide the CMaps needed?

Thanks.

-John

trutherous · Postby **trutherous** » May 26th, 2012 8:03 am

There just seems to be no end to the pdf problems. Have you tried the "lite" pdfs? Do they give the same problem? http://www.koreanclass101.com/forum/vie ... php?t=2984

JohnThompsonJTSoftware939 · Postby **JohnThompsonJTSoftware939** » May 27th, 2012 5:16 am

Thank you so much for the suggestion. I took a quick look at the "lite" version of beginner lesson 2, and it works fine for copy and pasting, and extracting text. Ironically, the "lite" versions are a lot bigger files (perhaps they have embedded more font information?), but I don't mind, as long as I can get the text out of them.

Having dug into it a little more, apparently you can create .pdfs which use font entry IDs instead of standard character values, such that the pdfs don't even have any usable text in them, unless a special character map (CMAP) is available to convert them to characters. Some of the text extraction tools (Adobe or otherwise) actually use optical character recognition techniques to look at the rendered glyphs to convert it back to text. This is just unbelievable that Adobe would do something so short-sighted. I wish they would just use UTF-8 encodings and be done with it.

Thanks again!

-John

jaehwi · Postby **jaehwi** » June 4th, 2012 4:45 am

Hi JohnThompsonJTSoftware939,

we are glad that the lite version of the .pdf files work ok for you! Thank you "trutherous" for your help too :wink:

In case you have any other problems, please let us know!

Stefania/KoreanClass101.com