Nanowrimo 2023 Underway

It’s that time of year, where Tammy and I unleash our inner authors. We’re four days in, and I have to admit that despite having October 30th and 31st off this year, I was woefully unprepared on November 1st. Instead of hitting the ground running… er, typing, I spent a chunk of the day just setting things up so that I could type. I hadn’t even brought my typewriters up from the basement yet, whereas normally I do a few typing sessions in October to limber up the fingers and the imagination.

On the afternoon of the first, I discovered that the optical character recognition website (OCR – turning a picture of a typewritten page into text) that I’ve used for years had introduced a new user-hostile “feature” to encourage people to sign up for a paid account. In years past, there was a limit of 15 pages scanned per hour (no problem, I can’t type that fast), but this year it was dropped to 5. Still not a big problem, but this year, when you would press the “Convert” button, it would start a 10 second timer before it would start converting. Not the biggest issue in the world, but when the process of OCR’ing the pages is the most annoying part of the night, it was certainly not welcome.

It was Tammy who suggested that I try scripting something. Back in 2009 when I first got a typewriter for Nano, I tried running the OCR software that came with my scanner and when that was horrible, I tried an open source tool called Tesseract. That also failed. The recognition was so bad that the output text was pure garbage. I counted words by hand that year. The next year I found the OCR website, and that was enough to get my analog input into the digital realm and I used it for the next 12 years. But the website changes were enough to drive me to do something different.

I already had a MacOS Shortcut that took a scanned multi-page PDF and exported each page as a numbered JPEG (I set that up last year). I took that and then added a step where each JPEG was run through Tesseract, output to a text file and then appended to my growing novel file.

Let me just say that it was a revelation. Not only was Tesseract now good (better even than the website), it was fast, local and now I go from PDF to text in seconds:

What is going on here?

  1. Right-click on the PDF file from the scanner. It has four scanned pages inside. The script is launched.
  2. The script prompts for the page number of the first page so it can number the images 15, 16, 17, 18.
  3. For each page:
    1. Export the image (e.g. 015.jpeg)
    2. Run the image through Tesseract to create a text file of the words (e.g. 015.txt)
    3. Concatenate the words into the main novel text file.
    4. Delete the text file (e.g. 015.txt)