Building MarkItLikeItsHot: A FastAPI Wrapper for Microsoft's Markdown Converter

This post details the development of a Python microservice named “MarkItLikeItsHot,” designed to streamline content conversion into Markdown within Willow CMS. This microservice uses a FastAPI wrapper around Microsoft’s powerful MarkItDown library, enabling seamless conversion of files, URLs, and raw text into clean, formatted Markdown. We’ll explore the underlying technologies, get into the code structure, and highlight key aspects like Docker deployment, testing, and configuration.

Understanding the Core Technologies

Before diving into the implementation, let’s clarify the roles of the key technologies:

What is MarkItDown?

MarkItDown, developed by Microsoft, is a robust library for converting various document formats (like DOCX, PDF, HTML) into Markdown. It handles the complexities of parsing different file structures and extracting content, providing a consistent Markdown output.

What is FastAPI?

FastAPI is a modern, high-performance Python web framework ideal for building APIs. It offers speed, ease of use, and automatic interactive API documentation.

Defining Endpoints with FastAPI

FastAPI uses decorators to define API endpoints. For example:

@app.post("/convert/file")
async def convert_file(file: UploadFile = File(...)):
    # ... conversion logic ...

This defines a POST endpoint at /convert/file that accepts a file upload. Similarly, other endpoints handle text and URL conversions:

/convert/text: /convert/text
/convert/url: /convert/url

Dockerizing the Microservice: Deployment Made Easy

Docker simplifies deployment and ensures consistency across environments. The setup involves a Dockerfile and docker-compose.yml.

The Dockerfile: Building the Image

The Dockerfile defines the environment for our microservice:

FROM python:3.13.1-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    git \
    ffmpeg \
    libsm6 \
    libxext6 \
    libmagic1 \
    tesseract-ocr \
    python3-venv \
    && rm -rf /var/lib/apt/lists/*

# Create and activate virtual environment
ENV VIRTUAL_ENV=/opt/venv
RUN python -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Copy requirements first for better caching
COPY requirements.txt .

# Update pip and install requirements in the virtual environment
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application
COPY ./app /app/app

# Command to run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

The Dockerfile sets up a lightweight and efficient environment for the microservice. We begin by basing our image on python:3.13.1-slim, a lean version of the official Python 3.13.1 image, minimizing the image’s overall size and attack surface. We then set the working directory inside the container to /app.

The next step involves installing system dependencies. Using apt-get, we update the package list and install tools like git, ffmpeg, tesseract-ocr, and others. These are required for various functionalities within the MarkItDown library, such as handling different file formats and processing images. The rm -rf /var/lib/apt/lists/* command cleans up the apt cache to further reduce the Docker image size.

To isolate our project’s dependencies and avoid conflicts, we create and activate a Python virtual environment. We set the VIRTUAL_ENV environment variable to /opt/venv and use python -m venv to create it. Modifying the PATH ensures that commands from the virtual environment are prioritized. For better Docker build caching, we copy requirements.txt before installing the Python dependencies. This allows Docker to reuse the cached layers if the requirements haven’t changed. We then upgrade pip and install the project’s Python packages using pip install with the --no-cache-dir flag to avoid caching within the container (caching is handled by Docker’s layered system).

Finally, we copy the application code into the container’s /app/app directory. The CMD instruction specifies the command to run when the container starts. It uses uvicorn to serve our FastAPI application, listening on all interfaces (0.0.0.0) on port 8000. This port is then mapped to a port on the host machine when running the container, allowing external access to the microservice.

Docker Compose: Orchestrating the Service

docker-compose.yml simplifies running the service with defined ports and volumes:

services:
  markitdown:
    build: 
      context: ./markitdown-service
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    volumes:
      - ./markitdown-service/app:/app/app
    environment:
      - ENVIRONMENT=development

Building and running becomes straightforward:

docker compose build
docker compose up -d

Deep Dive into `main.py`: The Core Logic

The main.py file houses the core logic of the microservice: Link to main.py

Data Handling and Validation: Pydantic models (TextInput, UrlInput) ensure data integrity.

This TextInput model validates the structure of incoming JSON for the /convert/text endpoint, enforcing the presence of a content field and an optional options field that can either be a dictionary (dict) or None.

class TextInput(BaseModel):
    content: str
    options: Optional[dict] = None

Here’s how this works in practice:

Valid Request (with options):

{
  "content": "Some text here",
  "options": {
    "some_option": "some_value"
  }
}

Valid Request (without options):

{
  "content": "Some text here"
}

In the second example, even though options isn’t provided, the request is still valid, and options will be set to None within the application logic. This flexibility allows users to provide additional options if needed, but doesn’t make them mandatory. At present the options are not passed to the MarkItDown library - I include them for future extensibility of the microservice, such as convertion options when processing a document or web page.

Temporary File Management: The save_temp_file function handles temporary storage of uploaded files and fetched web content, ensuring secure and efficient processing.
MarkItDown Integration: The process_conversion function encapsulates the conversion logic using the MarkItDown library. This function also includes specialized logic for handling Wikipedia URLs, optimizing content extraction for these pages.
Error Handling: Error handling with custom exceptions (FileProcessingError, ConversionError, URLFetchError) and HTTP exceptions provides informative user feedback. These can be extended and should probably move to a separate file in future.

Implementing Rate Limiting

Rate limiting is essential to protect the service from abuse if I expose it to the internet. We use slowapi for this purpose and a decorator:

@limiter.limit(f"{settings.RATE_LIMIT_REQUESTS}/hour")
@app.post("/convert/text")
# ...

This example limits requests to the /convert/text endpoint to 10 per hour (as per settings) per client IP. Link to rate limiting decorator for this endpoint

Code Structure and Configuration

MarkItLikeItsHot maintains a clean, DRY (Don’t Repeat Yourself) codebase through helper functions and centralized configuration.

Helper Functions: Functions like save_temp_file and process_conversion reduce redundancy and enhance readability.
Centralized Configuration: The config.py file, using pydantic_settings, manages application-wide settings, allowing easy customization of parameters like MAX_FILE_SIZE, SUPPORTED_EXTENSIONS, REQUEST_TIMEOUT, and rate limits. Link to config.py

Testing with Pytest and Docker

MarkItLikeItsHot pytest for testing, and Docker simplifies running these tests in a consistent environment. The docker-compose.yml file defines a separate test service specifically for running the tests:

  test:
    build: 
      context: ./markitdown-service
      dockerfile: Dockerfile
    volumes:
      - ./markitdown-service:/app
    environment:
      - ENVIRONMENT=test
      - PYTHONPATH=/app
    entrypoint: python -m pytest
    command: /app/tests/test_api.py -v --capture=no
    depends_on:
      - markitdown
    profiles:
      - test

Here’s how the test service is configured:

build: It uses the same Dockerfile as the main application, ensuring the testing environment mirrors the production environment.
volumes: Mounts the project directory (./markitdown-service) into the container at /app, making the test code accessible.
environment: Sets the ENVIRONMENT variable to test. This can be used within the application code (e.g., in config.py) to adjust settings specifically for the test environment, such as logging levels or database connections. The PYTHONPATH=/app setting ensures that the test runner can find the application code.
entrypoint: Defines the entry point for the container as python -m pytest. This runs the pytest test runner.
command: Specifies the command-line arguments for pytest. -v enables verbose output, and --capture=no disables output capturing, making it easier to see print statements during test execution. /app/tests/test_api.py specifies the test file to run. You can modify this to run specific tests or directories within the tests folder.
depends_on: Ensures that the markitdown service (the main application) is running before the tests start. This is crucial for integration tests that interact with the API endpoints.
profiles: Uses Docker Compose profiles to include this service only when running tests. This prevents the test container from starting when simply running docker compose up. To run the tests, you’d use sudo docker-compose --profile test run --rm test where the rm option will remove the container after it has run the tests.

The tests cover a wide range of scenarios:

File Conversion Tests: Verify correct conversion of different file types. Read the test to convert a Word Document
URL Conversion Tests: Test fetching and conversion of web pages, including specialized tests for Wikipedia URLs. Read the test to convert a BBC News article
Error Handling Tests: Ensure robust handling of invalid input and edge cases. Read the test for error handling
Rate Limiting Tests: Validate correct enforcement of rate limits. Read the rest for rate limiting

The test setup utilizes fixtures defined in conftest.py for efficient test setup and teardown. Read the source for conftest.py.

Testing with GitHub Actions

GitHub Actions automates the execution of these tests on every push and pull request, maintaining continuous integration. Link to GitHub Actions workflow

Next Steps

The immediate next step is integrating MarkItLikeItsHot into Willow CMS, providing a user-friendly interface for converting uploads, links, and text directly into Markdown drafts for blog posts and pages. Further features will include:

API key support to secure the service
Exploring and exposing additional MarkItDown features
adding support for custom conversion options (via the currently unused options field)
implementing more sophisticated error handling
exploring asynchronous processing for improved performance with very large files

So, check it out. The code is on GitHub with a good readme to help you start using it for your own projects.

Building "MarkItLikeItsHot": A FastAPI Wrapper for Microsoft's MarkItDown Conversion Engine

Understanding the Core Technologies

What is MarkItDown?

What is FastAPI?

Defining Endpoints with FastAPI

Dockerizing the Microservice: Deployment Made Easy

The Dockerfile: Building the Image

Docker Compose: Orchestrating the Service

Deep Dive into `main.py`: The Core Logic

Implementing Rate Limiting

Code Structure and Configuration

Testing with Pytest and Docker

Testing with GitHub Actions

Next Steps

Tags

Featured posts

Getting Started with Willow CMS: Posts and Pages

CSEND: P2P Chatting Across the Decades

Recent posts

CSEND: P2P Chatting Across the Decades

Getting back into C programming

Understanding the CakePHP Queue Plugin and Jobs

Elsewhere

Featured posts

Getting Started with Willow CMS: Posts and Pages

CSEND: P2P Chatting Across the Decades

Recent posts

CSEND: P2P Chatting Across the Decades

Getting back into C programming

Understanding the CakePHP Queue Plugin and Jobs

Elsewhere

Understanding the Core Technologies

What is MarkItDown?

What is FastAPI?

Defining Endpoints with FastAPI

Dockerizing the Microservice: Deployment Made Easy

The Dockerfile: Building the Image

Docker Compose: Orchestrating the Service

Deep Dive into main.py: The Core Logic

Implementing Rate Limiting

Code Structure and Configuration

Testing with Pytest and Docker

Testing with GitHub Actions

Next Steps

Tags

Featured posts

Getting Started with Willow CMS: Posts and Pages

CSEND: P2P Chatting Across the Decades

Recent posts

CSEND: P2P Chatting Across the Decades

Getting back into C programming

Understanding the CakePHP Queue Plugin and Jobs

Elsewhere

Featured posts

Getting Started with Willow CMS: Posts and Pages

CSEND: P2P Chatting Across the Decades

Recent posts

CSEND: P2P Chatting Across the Decades

Getting back into C programming

Understanding the CakePHP Queue Plugin and Jobs

Elsewhere

Deep Dive into `main.py`: The Core Logic