-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into feature/fix-code-comments
- Loading branch information
Showing
3 changed files
with
74 additions
and
62 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
// For format details, see https://aka.ms/devcontainer.json. For config options, see the | ||
// README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile | ||
{ | ||
"name": "Existing Dockerfile", | ||
"build": { | ||
// Sets the run context to one level up instead of the .devcontainer folder. | ||
"context": "..", | ||
// Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename. | ||
"dockerfile": "../Dockerfile" | ||
}, | ||
|
||
// Features to add to the dev container. More info: https://containers.dev/features. | ||
// "features": {}, | ||
"features": { | ||
"ghcr.io/devcontainers-extra/features/hatch:2": {} | ||
}, | ||
|
||
// Use 'forwardPorts' to make a list of ports inside the container available locally. | ||
// "forwardPorts": [], | ||
|
||
// Uncomment the next line to run commands after the container is created. | ||
// "postCreateCommand": "cat /etc/os-release", | ||
|
||
// Configure tool-specific properties. | ||
// "customizations": {}, | ||
|
||
// Uncomment to connect as an existing user other than the container default. More info: https://aka.ms/dev-containers-non-root. | ||
"remoteUser": "root" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,65 +2,47 @@ | |
|
||
[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/) | ||
|
||
The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.) | ||
MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc). | ||
It supports: | ||
- PowerPoint | ||
- Word | ||
- Excel | ||
- Images (EXIF metadata and OCR) | ||
- Audio (EXIF metadata and speech transcription) | ||
- HTML | ||
- Text-based formats (CSV, JSON, XML) | ||
- ZIP files (iterates over contents) | ||
|
||
It presently supports: | ||
To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source: `pip install -e .` | ||
|
||
- PDF (.pdf) | ||
- PowerPoint (.pptx) | ||
- Word (.docx) | ||
- Excel (.xlsx) | ||
- Images (EXIF metadata, and OCR) | ||
- Audio (EXIF metadata, and speech transcription) | ||
- HTML (special handling of Wikipedia, etc.) | ||
- Various other text-based formats (csv, json, xml, etc.) | ||
- ZIP (Iterates over contents and converts each file) | ||
## Usage | ||
|
||
# Installation | ||
### Command-Line | ||
|
||
You can install `markitdown` using pip: | ||
|
||
```python | ||
pip install markitdown | ||
```bash | ||
markitdown path-to-file.pdf > document.md | ||
``` | ||
|
||
or from the source | ||
You can also pipe content: | ||
|
||
```sh | ||
pip install -e . | ||
```bash | ||
cat path-to-file.pdf | markitdown | ||
``` | ||
|
||
# Usage | ||
The API is simple: | ||
### Python API | ||
|
||
Basic usage in Python: | ||
|
||
```python | ||
from markitdown import MarkItDown | ||
|
||
markitdown = MarkItDown() | ||
result = markitdown.convert("test.xlsx") | ||
md = MarkItDown() | ||
result = md.convert("test.xlsx") | ||
print(result.text_content) | ||
``` | ||
|
||
To use this as a command-line utility, install it and then run it like this: | ||
|
||
```bash | ||
markitdown path-to-file.pdf | ||
``` | ||
|
||
This will output Markdown to standard output. You can save it like this: | ||
|
||
```bash | ||
markitdown path-to-file.pdf > document.md | ||
``` | ||
|
||
You can pipe content to standard input by omitting the argument: | ||
|
||
```bash | ||
cat path-to-file.pdf | markitdown | ||
``` | ||
|
||
You can also configure markitdown to use Large Language Models to describe images. To do so you must provide `llm_client` and `llm_model` parameters to MarkItDown object, according to your specific client. | ||
|
||
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`: | ||
|
||
```python | ||
from markitdown import MarkItDown | ||
|
@@ -72,7 +54,7 @@ result = md.convert("example.jpg") | |
print(result.text_content) | ||
``` | ||
|
||
You can also use the project as Docker Image: | ||
### Docker | ||
|
||
```sh | ||
docker build -t markitdown:latest . | ||
|
@@ -93,28 +75,27 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope | |
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or | ||
contact [[email protected]](mailto:[email protected]) with any additional questions or comments. | ||
|
||
### Running Tests | ||
|
||
To run tests, install `hatch` using `pip` or other methods as described [here](https://hatch.pypa.io/dev/install). | ||
### Running Tests and Checks | ||
|
||
```sh | ||
pip install hatch | ||
hatch shell | ||
hatch test | ||
``` | ||
- Install `hatch` in your environment and run tests: | ||
```sh | ||
pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/ | ||
hatch shell | ||
hatch test | ||
``` | ||
|
||
### Running Pre-commit Checks | ||
(Alternative) Use the Devcontainer which has all the dependencies installed: | ||
```sh | ||
# Reopen the project in Devcontainer and run: | ||
hatch test | ||
``` | ||
|
||
Please run the pre-commit checks before submitting a PR. | ||
|
||
```sh | ||
pre-commit run --all-files | ||
``` | ||
- Run pre-commit checks before submitting a PR: `pre-commit run --all-files` | ||
|
||
## Trademarks | ||
|
||
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft | ||
trademarks or logos is subject to and must follow | ||
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft | ||
trademarks or logos is subject to and must follow | ||
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). | ||
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. | ||
Any use of third-party trademarks or logos are subject to those third-party's policies. |