Python script to convert photos taken of lecture ppts to docx text notes.

Python 100%

Find a file

blackruby 7fecb5c11f add debug logging for showing generated image queue		2026-05-03 23:30:07 +02:00
config	add config file to configure llm provider used for ocr	2026-05-03 19:06:18 +02:00
src	add debug logging for showing generated image queue	2026-05-03 23:30:07 +02:00
main.py	add debug argument	2026-05-03 23:25:19 +02:00
README.md	update arg list	2026-05-03 23:27:46 +02:00
requirements.txt	add pillow and piexif dependencies for image processing	2026-05-03 23:26:09 +02:00

README.md

LecturePptOcr

LecturePptOcr is a Python tool that extracts text from lecture slide images using a vision-capable LLM provider and writes the extracted content into a .docx document.

It is designed for workflows where lecture slides, screenshots, or photos are saved as image files and need to be converted into readable notes.

Features

Extracts text from image files using an LLM-based OCR
Caches OCR results to avoid re-processing the same images
Supports single image files or directories of images
Recursively scans folders for supported image formats
Writes extracted text into a .docx file
Supports local and remote OCR through Ollama vision models (for remote, host required in config/config.py)
Includes Gemini provider support (api key is required in config/config.py)
Supports the following image formats:
- .jpg
- .jpeg
- .heic

Requirements

Python 3.9+
Ollama installed locally if using the Ollama as LLM provider (set in config/config.py llm_provider)
Google AI Studio API key and billing setup if using Gemini as LLM provider (set in config/config.py llm_provider)
A vision-capable model, for example: qwen3-vl:4b with ollama or gemini-2.5-flash with gemini

Python Dependencies

Install the required packages inside your virtual environment: pip install -r requirements.txt

Usage

Run the script from the project root: python main.py -i "resources/" -o "lecture_notes.docx"

Arguments

Argument	Short	Required	Description
`--input-img-dir`	`-i`	Yes	Path to an image file or a folder containing images
`--output-file`	`-o`	Yes	Path to the output `.docx` file
`--dry-run`	None	No	Dry run mode, no actual file writing or ocr scanning
`--debug`	`-d`	No	Enable debug level logging

This will:

Scan the resources folder for supported images
Sorts the images into subfolders based on the date they were taken (oldest to most recent)
Send each image to the configured LLM provider to do OCR, unless the OCR cache has the text for that specific image stored from a previous run.
Extract text from the images
Cache the OCR results for future use to the ocr cache database.
Write the result to lecture_notes.docx

Notes

The output file must use the .docx extension.
Images are processed recursively when a directory is provided.
Only text is extracted from the images, no graphs, no images, also no formatting is preserved.
Existing .docx files are loaded and appended to instead of being overwritten.
For best results, use clear images with readable text and minimal blur.

Cost concerns for external LLM providers

When using an external LLM service (such as Gemini or Ollama), depending on the volume of images and the size of the images, using free tier might not be feasible: it's slow and you would get rate limited very quickly. In this case it is best to use a paid service and manually select the model for use.

It is best to use a model that is low cost and has good performance. You don't need insanely high intelligence with top-of-the-line reasoning capabilities for simply reading text from images...

With that in mind, I personally recommend either of the following models:

Ollama: qwen3-vl:4b
Gemini: gemini-2.5-flash
- costs around $0.00058 per 4k image