Skip to content

Commit

Permalink
Merge branch 'main' into feature/fix-code-comments
Browse files Browse the repository at this point in the history
  • Loading branch information
gagb authored Dec 18, 2024
2 parents 3622143 + b029ae1 commit 09cb048
Show file tree
Hide file tree
Showing 3 changed files with 74 additions and 62 deletions.
29 changes: 29 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile
{
"name": "Existing Dockerfile",
"build": {
// Sets the run context to one level up instead of the .devcontainer folder.
"context": "..",
// Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename.
"dockerfile": "../Dockerfile"
},

// Features to add to the dev container. More info: https://containers.dev/features.
// "features": {},
"features": {
"ghcr.io/devcontainers-extra/features/hatch:2": {}
},

// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],

// Uncomment the next line to run commands after the container is created.
// "postCreateCommand": "cat /etc/os-release",

// Configure tool-specific properties.
// "customizations": {},

// Uncomment to connect as an existing user other than the container default. More info: https://aka.ms/dev-containers-non-root.
"remoteUser": "root"
}
6 changes: 4 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
FROM python:3.13-alpine
FROM python:3.13-slim-bullseye

USER root

# Runtime dependency
RUN apk add --no-cache ffmpeg
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*

RUN pip install markitdown

Expand Down
101 changes: 41 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,65 +2,47 @@

[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)

The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
It supports:
- PDF
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)

It presently supports:
To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source: `pip install -e .`

- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
- ZIP (Iterates over contents and converts each file)
## Usage

# Installation
### Command-Line

You can install `markitdown` using pip:

```python
pip install markitdown
```bash
markitdown path-to-file.pdf > document.md
```

or from the source
You can also pipe content:

```sh
pip install -e .
```bash
cat path-to-file.pdf | markitdown
```

# Usage
The API is simple:
### Python API

Basic usage in Python:

```python
from markitdown import MarkItDown

markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
```

To use this as a command-line utility, install it and then run it like this:

```bash
markitdown path-to-file.pdf
```

This will output Markdown to standard output. You can save it like this:

```bash
markitdown path-to-file.pdf > document.md
```

You can pipe content to standard input by omitting the argument:

```bash
cat path-to-file.pdf | markitdown
```

You can also configure markitdown to use Large Language Models to describe images. To do so you must provide `llm_client` and `llm_model` parameters to MarkItDown object, according to your specific client.

To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:

```python
from markitdown import MarkItDown
Expand All @@ -72,7 +54,7 @@ result = md.convert("example.jpg")
print(result.text_content)
```

You can also use the project as Docker Image:
### Docker

```sh
docker build -t markitdown:latest .
Expand All @@ -93,28 +75,27 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [[email protected]](mailto:[email protected]) with any additional questions or comments.

### Running Tests

To run tests, install `hatch` using `pip` or other methods as described [here](https://hatch.pypa.io/dev/install).
### Running Tests and Checks

```sh
pip install hatch
hatch shell
hatch test
```
- Install `hatch` in your environment and run tests:
```sh
pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
hatch shell
hatch test
```

### Running Pre-commit Checks
(Alternative) Use the Devcontainer which has all the dependencies installed:
```sh
# Reopen the project in Devcontainer and run:
hatch test
```

Please run the pre-commit checks before submitting a PR.

```sh
pre-commit run --all-files
```
- Run pre-commit checks before submitting a PR: `pre-commit run --all-files`

## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.

0 comments on commit 09cb048

Please sign in to comment.