- Python 100%
| config | ||
| src | ||
| main.py | ||
| README.md | ||
| requirements.txt | ||
LecturePptOcr
LecturePptOcr is a Python tool that extracts text from lecture slide images using a vision-capable LLM provider and writes the extracted content into a .docx document.
It is designed for workflows where lecture slides, screenshots, or photos are saved as image files and need to be converted into readable notes.
Features
- Extracts text from image files using an LLM-based OCR
- Caches OCR results to avoid re-processing the same images
- Supports single image files or directories of images
- Recursively scans folders for supported image formats
- Writes extracted text into a
.docxfile - Supports local and remote OCR through Ollama vision models (for remote, host required in config/config.py)
- Includes Gemini provider support (api key is required in config/config.py)
- Supports the following image formats:
.jpg.jpeg.heic
Requirements
- Python 3.9+
- Ollama installed locally if using the Ollama as LLM provider (set in config/config.py llm_provider)
- Google AI Studio API key and billing setup if using Gemini as LLM provider (set in config/config.py llm_provider)
- A vision-capable model, for example:
qwen3-vl:4b with ollamaorgemini-2.5-flash with gemini
Python Dependencies
Install the required packages inside your virtual environment:
pip install -r requirements.txt
Usage
Run the script from the project root:
python main.py -i "resources/" -o "lecture_notes.docx"
Arguments
| Argument | Short | Required | Description |
|---|---|---|---|
--input-img-dir |
-i |
Yes | Path to an image file or a folder containing images |
--output-file |
-o |
Yes | Path to the output .docx file |
--dry-run |
None | No | Dry run mode, no actual file writing or ocr scanning |
--debug |
-d |
No | Enable debug level logging |
This will:
- Scan the
resourcesfolder for supported images - Sorts the images into subfolders based on the date they were taken (oldest to most recent)
- Send each image to the configured LLM provider to do OCR, unless the OCR cache has the text for that specific image stored from a previous run.
- Extract text from the images
- Cache the OCR results for future use to the ocr cache database.
- Write the result to
lecture_notes.docx
Notes
- The output file must use the
.docxextension. - Images are processed recursively when a directory is provided.
- Only text is extracted from the images, no graphs, no images, also no formatting is preserved.
- Existing
.docxfiles are loaded and appended to instead of being overwritten. - For best results, use clear images with readable text and minimal blur.
Cost concerns for external LLM providers
When using an external LLM service (such as Gemini or Ollama), depending on the volume of images and the size of the images, using free tier might not be feasible: it's slow and you would get rate limited very quickly. In this case it is best to use a paid service and manually select the model for use.
It is best to use a model that is low cost and has good performance. You don't need insanely high intelligence with top-of-the-line reasoning capabilities for simply reading text from images...
With that in mind, I personally recommend either of the following models:
- Ollama: qwen3-vl:4b
- Gemini: gemini-2.5-flash
- costs around $0.00058 per 4k image