A bearded chef in a rustic kitchen carefully preparing a beef wellington dish, with pastry dough, herbs, and other ingredients visible on the counter.

This post details the development of a Python microservice named ā€œMarkItLikeItsHot,ā€ designed to streamline content conversion into Markdown within Willow CMS. This microservice uses a FastAPI wrapper around Microsoftā€™s powerful MarkItDown library, enabling seamless conversion of files, URLs, and raw text into clean, formatted Markdown. Weā€™ll explore the underlying technologies, get into the code structure, and highlight key aspects like Docker deployment, testing, and configuration.

Understanding the Core Technologies

Before diving into the implementation, letā€™s clarify the roles of the key technologies:

What is MarkItDown?

MarkItDown, developed by Microsoft, is a robust library for converting various document formats (like DOCX, PDF, HTML) into Markdown. It handles the complexities of parsing different file structures and extracting content, providing a consistent Markdown output.

What is FastAPI?

FastAPI is a modern, high-performance Python web framework ideal for building APIs. It offers speed, ease of use, and automatic interactive API documentation.

Defining Endpoints with FastAPI

FastAPI uses decorators to define API endpoints. For example:

@app.post("/convert/file")
async def convert_file(file: UploadFile = File(...)):
    # ... conversion logic ...

This defines a POST endpoint at /convert/file that accepts a file upload. Similarly, other endpoints handle text and URL conversions:

Dockerizing the Microservice: Deployment Made Easy

Docker simplifies deployment and ensures consistency across environments. The setup involves a Dockerfile and docker-compose.yml.

The Dockerfile: Building the Image

The Dockerfile defines the environment for our microservice:

FROM python:3.13.1-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    git \
    ffmpeg \
    libsm6 \
    libxext6 \
    libmagic1 \
    tesseract-ocr \
    python3-venv \
    && rm -rf /var/lib/apt/lists/*

# Create and activate virtual environment
ENV VIRTUAL_ENV=/opt/venv
RUN python -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Copy requirements first for better caching
COPY requirements.txt .

# Update pip and install requirements in the virtual environment
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application
COPY ./app /app/app

# Command to run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

The Dockerfile sets up a lightweight and efficient environment for the microservice. We begin by basing our image on python:3.13.1-slim, a lean version of the official Python 3.13.1 image, minimizing the imageā€™s overall size and attack surface. We then set the working directory inside the container to /app.

The next step involves installing system dependencies. Using apt-get, we update the package list and install tools like git, ffmpeg, tesseract-ocr, and others. These are required for various functionalities within the MarkItDown library, such as handling different file formats and processing images. The rm -rf /var/lib/apt/lists/* command cleans up the apt cache to further reduce the Docker image size.

To isolate our projectā€™s dependencies and avoid conflicts, we create and activate a Python virtual environment. We set the VIRTUAL_ENV environment variable to /opt/venv and use python -m venv to create it. Modifying the PATH ensures that commands from the virtual environment are prioritized. For better Docker build caching, we copy requirements.txt before installing the Python dependencies. This allows Docker to reuse the cached layers if the requirements havenā€™t changed. We then upgrade pip and install the projectā€™s Python packages using pip install with the --no-cache-dir flag to avoid caching within the container (caching is handled by Dockerā€™s layered system).

Finally, we copy the application code into the containerā€™s /app/app directory. The CMD instruction specifies the command to run when the container starts. It uses uvicorn to serve our FastAPI application, listening on all interfaces (0.0.0.0) on port 8000. This port is then mapped to a port on the host machine when running the container, allowing external access to the microservice.

Docker Compose: Orchestrating the Service

docker-compose.yml simplifies running the service with defined ports and volumes:

services:
  markitdown:
    build: 
      context: ./markitdown-service
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    volumes:
      - ./markitdown-service/app:/app/app
    environment:
      - ENVIRONMENT=development

Building and running becomes straightforward:

docker compose build
docker compose up -d

Deep Dive into main.py: The Core Logic

The main.py file houses the core logic of the microservice: Link to main.py

  • Data Handling and Validation: Pydantic models (TextInput, UrlInput) ensure data integrity.

This TextInput model validates the structure of incoming JSON for the /convert/text endpoint, enforcing the presence of a content field and an optional options field that can either be a dictionary (dict) or None.

class TextInput(BaseModel):
    content: str
    options: Optional[dict] = None

Hereā€™s how this works in practice:

Valid Request (with options):

{
  "content": "Some text here",
  "options": {
    "some_option": "some_value"
  }
}

Valid Request (without options):

{
  "content": "Some text here"
}

In the second example, even though options isnā€™t provided, the request is still valid, and options will be set to None within the application logic. This flexibility allows users to provide additional options if needed, but doesnā€™t make them mandatory. At present the options are not passed to the MarkItDown library - I include them for future extensibility of the microservice, such as convertion options when processing a document or web page.

  • Temporary File Management: The save_temp_file function handles temporary storage of uploaded files and fetched web content, ensuring secure and efficient processing.
  • MarkItDown Integration: The process_conversion function encapsulates the conversion logic using the MarkItDown library. This function also includes specialized logic for handling Wikipedia URLs, optimizing content extraction for these pages.
  • Error Handling: Error handling with custom exceptions (FileProcessingError, ConversionError, URLFetchError) and HTTP exceptions provides informative user feedback. These can be extended and should probably move to a separate file in future.

Implementing Rate Limiting

Rate limiting is essential to protect the service from abuse if I expose it to the internet. We use slowapi for this purpose and a decorator:

@limiter.limit(f"{settings.RATE_LIMIT_REQUESTS}/hour")
@app.post("/convert/text")
# ...

This example limits requests to the /convert/text endpoint to 10 per hour (as per settings) per client IP. Link to rate limiting decorator for this endpoint

Code Structure and Configuration

MarkItLikeItsHot maintains a clean, DRY (Donā€™t Repeat Yourself) codebase through helper functions and centralized configuration.

  • Helper Functions: Functions like save_temp_file and process_conversion reduce redundancy and enhance readability.
  • Centralized Configuration: The config.py file, using pydantic_settings, manages application-wide settings, allowing easy customization of parameters like MAX_FILE_SIZE, SUPPORTED_EXTENSIONS, REQUEST_TIMEOUT, and rate limits. Link to config.py

Testing with Pytest and Docker

MarkItLikeItsHot pytest for testing, and Docker simplifies running these tests in a consistent environment. The docker-compose.yml file defines a separate test service specifically for running the tests:

  test:
    build: 
      context: ./markitdown-service
      dockerfile: Dockerfile
    volumes:
      - ./markitdown-service:/app
    environment:
      - ENVIRONMENT=test
      - PYTHONPATH=/app
    entrypoint: python -m pytest
    command: /app/tests/test_api.py -v --capture=no
    depends_on:
      - markitdown
    profiles:
      - test

Hereā€™s how the test service is configured:

  • build: It uses the same Dockerfile as the main application, ensuring the testing environment mirrors the production environment.
  • volumes: Mounts the project directory (./markitdown-service) into the container at /app, making the test code accessible.
  • environment: Sets the ENVIRONMENT variable to test. This can be used within the application code (e.g., in config.py) to adjust settings specifically for the test environment, such as logging levels or database connections. The PYTHONPATH=/app setting ensures that the test runner can find the application code.
  • entrypoint: Defines the entry point for the container as python -m pytest. This runs the pytest test runner.
  • command: Specifies the command-line arguments for pytest. -v enables verbose output, and --capture=no disables output capturing, making it easier to see print statements during test execution. /app/tests/test_api.py specifies the test file to run. You can modify this to run specific tests or directories within the tests folder.
  • depends_on: Ensures that the markitdown service (the main application) is running before the tests start. This is crucial for integration tests that interact with the API endpoints.
  • profiles: Uses Docker Compose profiles to include this service only when running tests. This prevents the test container from starting when simply running docker compose up. To run the tests, youā€™d use sudo docker-compose --profile test run --rm test where the rm option will remove the container after it has run the tests.

The tests cover a wide range of scenarios:

The test setup utilizes fixtures defined in conftest.py for efficient test setup and teardown. Read the source for conftest.py.

Testing with GitHub Actions

GitHub Actions automates the execution of these tests on every push and pull request, maintaining continuous integration. Link to GitHub Actions workflow

Next Steps

The immediate next step is integrating MarkItLikeItsHot into Willow CMS, providing a user-friendly interface for converting uploads, links, and text directly into Markdown drafts for blog posts and pages. Further features will include:

  • API key support to secure the service
  • Exploring and exposing additional MarkItDown features
  • adding support for custom conversion options (via the currently unused options field)
  • implementing more sophisticated error handling
  • exploring asynchronous processing for improved performance with very large files

So, check it out. The code is on GitHub with a good readme to help you start using it for your own projects.

Tags

CMS DockerCompose FastAPI Code ContentEditing Docker Python GitHubActions Microservices Testing Dockerfile