Python script to convert photos taken of lecture ppts to docx text notes.
Find a file
2026-05-03 23:30:07 +02:00
config add config file to configure llm provider used for ocr 2026-05-03 19:06:18 +02:00
src add debug logging for showing generated image queue 2026-05-03 23:30:07 +02:00
main.py add debug argument 2026-05-03 23:25:19 +02:00
README.md update arg list 2026-05-03 23:27:46 +02:00
requirements.txt add pillow and piexif dependencies for image processing 2026-05-03 23:26:09 +02:00

LecturePptOcr

LecturePptOcr is a Python tool that extracts text from lecture slide images using a vision-capable LLM provider and writes the extracted content into a .docx document.

It is designed for workflows where lecture slides, screenshots, or photos are saved as image files and need to be converted into readable notes.

Features

  • Extracts text from image files using an LLM-based OCR
  • Caches OCR results to avoid re-processing the same images
  • Supports single image files or directories of images
  • Recursively scans folders for supported image formats
  • Writes extracted text into a .docx file
  • Supports local and remote OCR through Ollama vision models (for remote, host required in config/config.py)
  • Includes Gemini provider support (api key is required in config/config.py)
  • Supports the following image formats:
    • .jpg
    • .jpeg
    • .heic

Requirements

  • Python 3.9+
  • Ollama installed locally if using the Ollama as LLM provider (set in config/config.py llm_provider)
  • Google AI Studio API key and billing setup if using Gemini as LLM provider (set in config/config.py llm_provider)
  • A vision-capable model, for example: qwen3-vl:4b with ollama or gemini-2.5-flash with gemini

Python Dependencies

Install the required packages inside your virtual environment: pip install -r requirements.txt

Usage

Run the script from the project root: python main.py -i "resources/" -o "lecture_notes.docx"

Arguments

Argument Short Required Description
--input-img-dir -i Yes Path to an image file or a folder containing images
--output-file -o Yes Path to the output .docx file
--dry-run None No Dry run mode, no actual file writing or ocr scanning
--debug -d No Enable debug level logging

This will:

  1. Scan the resources folder for supported images
  2. Sorts the images into subfolders based on the date they were taken (oldest to most recent)
  3. Send each image to the configured LLM provider to do OCR, unless the OCR cache has the text for that specific image stored from a previous run.
  4. Extract text from the images
  5. Cache the OCR results for future use to the ocr cache database.
  6. Write the result to lecture_notes.docx

Notes

  • The output file must use the .docx extension.
  • Images are processed recursively when a directory is provided.
  • Only text is extracted from the images, no graphs, no images, also no formatting is preserved.
  • Existing .docx files are loaded and appended to instead of being overwritten.
  • For best results, use clear images with readable text and minimal blur.

Cost concerns for external LLM providers

When using an external LLM service (such as Gemini or Ollama), depending on the volume of images and the size of the images, using free tier might not be feasible: it's slow and you would get rate limited very quickly. In this case it is best to use a paid service and manually select the model for use.

It is best to use a model that is low cost and has good performance. You don't need insanely high intelligence with top-of-the-line reasoning capabilities for simply reading text from images...

With that in mind, I personally recommend either of the following models:

  • Ollama: qwen3-vl:4b
  • Gemini: gemini-2.5-flash
    • costs around $0.00058 per 4k image