This post details the development of a Python microservice named āMarkItLikeItsHot,ā designed to streamline content conversion into Markdown within Willow CMS. This microservice uses a FastAPI wrapper around Microsoftās powerful MarkItDown library, enabling seamless conversion of files, URLs, and raw text into clean, formatted Markdown. Weāll explore the underlying technologies, get into the code structure, and highlight key aspects like Docker deployment, testing, and configuration.
Understanding the Core Technologies
Before diving into the implementation, letās clarify the roles of the key technologies:
What is MarkItDown?
MarkItDown, developed by Microsoft, is a robust library for converting various document formats (like DOCX, PDF, HTML) into Markdown. It handles the complexities of parsing different file structures and extracting content, providing a consistent Markdown output.
What is FastAPI?
FastAPI is a modern, high-performance Python web framework ideal for building APIs. It offers speed, ease of use, and automatic interactive API documentation.
Defining Endpoints with FastAPI
FastAPI uses decorators to define API endpoints. For example:
@app.post("/convert/file")
async def convert_file(file: UploadFile = File(...)):
# ... conversion logic ...
This defines a POST endpoint at /convert/file
that accepts a file upload. Similarly, other endpoints handle text and URL conversions:
/convert/text
: /convert/text/convert/url
: /convert/url
Dockerizing the Microservice: Deployment Made Easy
Docker simplifies deployment and ensures consistency across environments. The setup involves a Dockerfile
and docker-compose.yml
.
The Dockerfile: Building the Image
The Dockerfile
defines the environment for our microservice:
FROM python:3.13.1-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
git \
ffmpeg \
libsm6 \
libxext6 \
libmagic1 \
tesseract-ocr \
python3-venv \
&& rm -rf /var/lib/apt/lists/*
# Create and activate virtual environment
ENV VIRTUAL_ENV=/opt/venv
RUN python -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Copy requirements first for better caching
COPY requirements.txt .
# Update pip and install requirements in the virtual environment
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# Copy the rest of the application
COPY ./app /app/app
# Command to run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
The Dockerfile
sets up a lightweight and efficient environment for the microservice. We begin by basing our image on python:3.13.1-slim
, a lean version of the official Python 3.13.1 image, minimizing the imageās overall size and attack surface. We then set the working directory inside the container to /app
.
The next step involves installing system dependencies. Using apt-get
, we update the package list and install tools like git
, ffmpeg
, tesseract-ocr
, and others. These are required for various functionalities within the MarkItDown library, such as handling different file formats and processing images. The rm -rf /var/lib/apt/lists/*
command cleans up the apt cache to further reduce the Docker image size.
To isolate our projectās dependencies and avoid conflicts, we create and activate a Python virtual environment. We set the VIRTUAL_ENV
environment variable to /opt/venv
and use python -m venv
to create it. Modifying the PATH
ensures that commands from the virtual environment are prioritized. For better Docker build caching, we copy requirements.txt
before installing the Python dependencies. This allows Docker to reuse the cached layers if the requirements havenāt changed. We then upgrade pip
and install the projectās Python packages using pip install
with the --no-cache-dir
flag to avoid caching within the container (caching is handled by Dockerās layered system).
Finally, we copy the application code into the containerās /app/app
directory. The CMD
instruction specifies the command to run when the container starts. It uses uvicorn
to serve our FastAPI application, listening on all interfaces (0.0.0.0
) on port 8000. This port is then mapped to a port on the host machine when running the container, allowing external access to the microservice.
Docker Compose: Orchestrating the Service
docker-compose.yml
simplifies running the service with defined ports and volumes:
services:
markitdown:
build:
context: ./markitdown-service
dockerfile: Dockerfile
ports:
- "8000:8000"
volumes:
- ./markitdown-service/app:/app/app
environment:
- ENVIRONMENT=development
Building and running becomes straightforward:
docker compose build
docker compose up -d
Deep Dive into main.py
: The Core Logic
The main.py
file houses the core logic of the microservice: Link to main.py
This TextInput model validates the structure of incoming JSON for the /convert/text endpoint, enforcing the presence of a content field and an optional options field that can either be a dictionary (dict) or None.
class TextInput(BaseModel):
content: str
options: Optional[dict] = None
Hereās how this works in practice:
Valid Request (with options):
{
"content": "Some text here",
"options": {
"some_option": "some_value"
}
}
Valid Request (without options):
{
"content": "Some text here"
}
In the second example, even though options isnāt provided, the request is still valid, and options will be set to None within the application logic. This flexibility allows users to provide additional options if needed, but doesnāt make them mandatory. At present the options are not passed to the MarkItDown library - I include them for future extensibility of the microservice, such as convertion options when processing a document or web page.
- Temporary File Management: The
save_temp_file
function handles temporary storage of uploaded files and fetched web content, ensuring secure and efficient processing. - MarkItDown Integration: The
process_conversion
function encapsulates the conversion logic using the MarkItDown library. This function also includes specialized logic for handling Wikipedia URLs, optimizing content extraction for these pages. - Error Handling: Error handling with custom exceptions (
FileProcessingError
,ConversionError
,URLFetchError
) and HTTP exceptions provides informative user feedback. These can be extended and should probably move to a separate file in future.
Implementing Rate Limiting
Rate limiting is essential to protect the service from abuse if I expose it to the internet. We use slowapi
for this purpose and a decorator:
@limiter.limit(f"{settings.RATE_LIMIT_REQUESTS}/hour")
@app.post("/convert/text")
# ...
This example limits requests to the /convert/text
endpoint to 10 per hour (as per settings) per client IP. Link to rate limiting decorator for this endpoint
Code Structure and Configuration
MarkItLikeItsHot maintains a clean, DRY (Donāt Repeat Yourself) codebase through helper functions and centralized configuration.
- Helper Functions: Functions like
save_temp_file
andprocess_conversion
reduce redundancy and enhance readability. - Centralized Configuration: The
config.py
file, usingpydantic_settings
, manages application-wide settings, allowing easy customization of parameters likeMAX_FILE_SIZE
,SUPPORTED_EXTENSIONS
,REQUEST_TIMEOUT
, and rate limits. Link to config.py
Testing with Pytest and Docker
MarkItLikeItsHot pytest for testing, and Docker simplifies running these tests in a consistent environment. The docker-compose.yml file defines a separate test service specifically for running the tests:
test:
build:
context: ./markitdown-service
dockerfile: Dockerfile
volumes:
- ./markitdown-service:/app
environment:
- ENVIRONMENT=test
- PYTHONPATH=/app
entrypoint: python -m pytest
command: /app/tests/test_api.py -v --capture=no
depends_on:
- markitdown
profiles:
- test
Hereās how the test
service is configured:
build
: It uses the sameDockerfile
as the main application, ensuring the testing environment mirrors the production environment.volumes
: Mounts the project directory (./markitdown-service
) into the container at/app
, making the test code accessible.environment
: Sets theENVIRONMENT
variable totest
. This can be used within the application code (e.g., inconfig.py
) to adjust settings specifically for the test environment, such as logging levels or database connections. ThePYTHONPATH=/app
setting ensures that the test runner can find the application code.entrypoint
: Defines the entry point for the container aspython -m pytest
. This runs thepytest
test runner.command
: Specifies the command-line arguments forpytest
.-v
enables verbose output, and--capture=no
disables output capturing, making it easier to see print statements during test execution./app/tests/test_api.py
specifies the test file to run. You can modify this to run specific tests or directories within thetests
folder.depends_on
: Ensures that themarkitdown
service (the main application) is running before the tests start. This is crucial for integration tests that interact with the API endpoints.profiles
: Uses Docker Compose profiles to include this service only when running tests. This prevents the test container from starting when simply runningdocker compose up
. To run the tests, youād usesudo docker-compose --profile test run --rm test
where the rm option will remove the container after it has run the tests.
The tests cover a wide range of scenarios:
- File Conversion Tests: Verify correct conversion of different file types. Read the test to convert a Word Document
- URL Conversion Tests: Test fetching and conversion of web pages, including specialized tests for Wikipedia URLs. Read the test to convert a BBC News article
- Error Handling Tests: Ensure robust handling of invalid input and edge cases. Read the test for error handling
- Rate Limiting Tests: Validate correct enforcement of rate limits. Read the rest for rate limiting
The test setup utilizes fixtures defined in conftest.py
for efficient test setup and teardown. Read the source for conftest.py.
Testing with GitHub Actions
GitHub Actions automates the execution of these tests on every push and pull request, maintaining continuous integration. Link to GitHub Actions workflow
Next Steps
The immediate next step is integrating MarkItLikeItsHot into Willow CMS, providing a user-friendly interface for converting uploads, links, and text directly into Markdown drafts for blog posts and pages. Further features will include:
- API key support to secure the service
- Exploring and exposing additional MarkItDown features
- adding support for custom conversion options (via the currently unused
options
field) - implementing more sophisticated error handling
- exploring asynchronous processing for improved performance with very large files
So, check it out. The code is on GitHub with a good readme to help you start using it for your own projects.