Jump to content
The Dark Mod Forums
Sign in to follow this  
Anderson

Automatic cropping of margins for pdf documents

Recommended Posts

Hi, knowing that this is a more tech savvy forums I wanted to share a struggle with some documents that maybe someone has any knowledge how to solve.

I've been working for the past days looking for ways to edit a large number of pages (1850) into an easily readable text, read for instant printing, all of this being easy to email in portions. The text are photos I made from some text files but the margins are not even all the time. Therefore I need an application that sees the real margins of the text and can crop it automatically to those margins.

Tried everything ABBYY Fine Reader 11, Adobe Acrobat Reader DC, Adobe Photoshop, Scan Tailor, AVS image converter. None of them seem to have any options to properly bring these files in order. I need to have some margins removed so that only the text remains for printing. But I need to do it fast and automatically. I do not care about this being editable/searcheable text like Optical Character Recognition features.

TLDR I need an app like TinyScanner for phones https://play.google.com/store/apps/details?id=com.appxy.tinyscanner&hl=en . However I did not have time to wait and let it process 1850 pages in one day. Also that app occasionally crashed while processing files. But, the features that it has, it seems I cannot find on PC's which I'm baffled about!

If anyone has any clue where to look for, please let me know.

Thanks in advance.

Edited by Anderson

"I really perceive that vanity about which most men merely prate — the vanity of the human or temporal life. I live continually in a reverie of the future. I have no faith in human perfectibility. I think that human exertion will have no appreciable effect upon humanity. Man is now only more active — not more happy — nor more wise, than he was 6000 years ago. The result will never vary — and to suppose that it will, is to suppose that the foregone man has lived in vain — that the foregone time is but the rudiment of the future — that the myriads who have perished have not been upon equal footing with ourselves — nor are we with our posterity. I cannot agree to lose sight of man the individual, in man the mass."...

- 2 July 1844 letter to James Russell Lowell from Edgar Allan Poe.

Share this post


Link to post
Share on other sites
On 12/12/2019 at 2:26 AM, SuaveSteve said:

Do you have any scripting or programming experience?

None whatsoever.


"I really perceive that vanity about which most men merely prate — the vanity of the human or temporal life. I live continually in a reverie of the future. I have no faith in human perfectibility. I think that human exertion will have no appreciable effect upon humanity. Man is now only more active — not more happy — nor more wise, than he was 6000 years ago. The result will never vary — and to suppose that it will, is to suppose that the foregone man has lived in vain — that the foregone time is but the rudiment of the future — that the myriads who have perished have not been upon equal footing with ourselves — nor are we with our posterity. I cannot agree to lose sight of man the individual, in man the mass."...

- 2 July 1844 letter to James Russell Lowell from Edgar Allan Poe.

Share this post


Link to post
Share on other sites

It's going to take a bit of work, but I'll explain how I would approach and solve this.

First, you are going to need to know some basic Windows command line knowledge and batch scripting. This is so that you are able to mix and match the various command line tools out there to do what you want and automate it at the same time.

Google is your friend, but here's some jumping off points:

https://www.bleepingcomputer.com/tutorials/windows-command-prompt-introduction/

https://www.csie.ntu.edu.tw/~r92092/ref/win32/win32scripting.html

You can reuse the knowledge elsewhere, it's very powerful. You do not have to achieve expert status, just be comfortable enough to execute commands and manipulate their output. Think "I want to convert all the PNGs in this folder to jpeg, but also crop and sharpen them".

Once you're comfortable with that. There's a number of tools to solve your problem.

You are basically wanting to identify text on the page, so that you can see where it is not (the margins). Tesseract is an open source solution for OCR.

https://github.com/tesseract-ocr/tesseract/wiki/FAQ

Once you you identify the bounds of the text, you can thing use another tool to do the cropping.

https://imagemagick.org/script/command-line-processing.php

Oh, and I suppose your PDFs are full of just images, you can use another tool (like Apache's PDFBox) to extract them (to do the above mentioned) and perhaps make a new PDF if you want.

https://pdfbox.apache.org/1.8/commandline.html#extractimages

It might not be clear how you can connect these, but learning the basic scripting knowledge will make it clear.

Good luck.

  • Like 1

Share this post


Link to post
Share on other sites
1 hour ago, SuaveSteve said:

It's going to take a bit of work, but I'll explain how I would approach and solve this.

First, you are going to need to know some basic Windows command line knowledge and batch scripting. This is so that you are able to mix and match the various command line tools out there to do what you want and automate it at the same time.

Google is your friend, but here's some jumping off points:

https://www.bleepingcomputer.com/tutorials/windows-command-prompt-introduction/

https://www.csie.ntu.edu.tw/~r92092/ref/win32/win32scripting.html

You can reuse the knowledge elsewhere, it's very powerful. You do not have to achieve expert status, just be comfortable enough to execute commands and manipulate their output. Think "I want to convert all the PNGs in this folder to jpeg, but also crop and sharpen them".

Once you're comfortable with that. There's a number of tools to solve your problem.

You are basically wanting to identify text on the page, so that you can see where it is not (the margins). Tesseract is an open source solution for OCR.

https://github.com/tesseract-ocr/tesseract/wiki/FAQ

Once you you identify the bounds of the text, you can thing use another tool to do the cropping.

https://imagemagick.org/script/command-line-processing.php

Oh, and I suppose your PDFs are full of just images, you can use another tool (like Apache's PDFBox) to extract them (to do the above mentioned) and perhaps make a new PDF if you want.

https://pdfbox.apache.org/1.8/commandline.html#extractimages

It might not be clear how you can connect these, but learning the basic scripting knowledge will make it clear.

Good luck.

Thank you! I'll try this as soon as possible.

It's ok for me to do this manually as well. It is somewhat amazing that an amateur cell phone app can do it while nothing similar can be found on PC's.

What's important is that I cannot have access to these documents forever so I want to be sure the images are saved and then I can work with them in peace on the PC at home/office. The TinyScanner app has the problem with not saving apps that are not yet processed if it crashes.


"I really perceive that vanity about which most men merely prate — the vanity of the human or temporal life. I live continually in a reverie of the future. I have no faith in human perfectibility. I think that human exertion will have no appreciable effect upon humanity. Man is now only more active — not more happy — nor more wise, than he was 6000 years ago. The result will never vary — and to suppose that it will, is to suppose that the foregone man has lived in vain — that the foregone time is but the rudiment of the future — that the myriads who have perished have not been upon equal footing with ourselves — nor are we with our posterity. I cannot agree to lose sight of man the individual, in man the mass."...

- 2 July 1844 letter to James Russell Lowell from Edgar Allan Poe.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

×
×
  • Create New...