diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
new file mode 100644
index 0000000..c32e5e1
--- /dev/null
+++ b/.github/CONTRIBUTING.md
@@ -0,0 +1,76 @@
+# Contributing to WPextract
+
+We welcome contributions to WPextract! Here's some helpful guidelines.
+
+## Get Started
+
+To install WPextract for development, you'll need [Poetry](https://python-poetry.org/) to automatically manage the virtual environment.
+
+Fork and clone the repo, then run `poetry install` (or `poetry install --with docs` if you'd like to build the documentation locally). This will install all the dependencies and the package itself as an editable install.
+
+It's best practice to make an issue before a PR so any necessary discussions can be had.
+
+## Testing
+
+Tests for WPextract are written with pytest. Some tips for writing tests:
+
+* approximately follow the module structure in the tests directory
+* use [pytest-datadir](https://pypi.org/project/pytest-datadir/) to handle disk usage
+* make sure to properly mock parts of the code which make HTTP requests (see tests of the `download` module for help)
+
+To run tests, use:
+
+```shell-session
+# Just run tests
+$ make testonly
+# Run tests and open coverage HTML
+$ make test
+```
+
+## Linting
+
+We use [Ruff](https://docs.astral.sh/ruff/) to lint WPextract. This happens in two stages, which can be easily run with Make tasks:
+
+```shell-session
+# Reformat code
+$ make format
+# Find problems, autofixing if possible
+$ make lint
+```
+
+Both library code and tests are linted (although tests are slightly less restrictive, see `pyproject.toml`).
+
+## Branch Management
+
+Generally your contribution should be made to the `dev` branch. We will then merge it into `main` only when it's time to release.
+
+The exception to this is for documentation, where changes should be applied directly to `main` if they are corrections of the current documentation version (but still `dev` if they relate to upcoming changes).
+
+## Documentation
+
+Documentation for WPextract is built with Mkdocs and Read the Docs.
+
+To build documentation locally (ensuring that the project was installed with the `--with docs` flag), run:
+
+```shell-session
+$ make docdev
+```
+
+When a PR is created, Read the Docs will build a preview version. If this isn't left as a comment on the PR, check [the dashboard here](https://readthedocs.org/projects/wpextract/builds/).
+
+Documentation is hosted at:
+
+- The [latest](https://wpextract.readthedocs.io/en/latest/) version is built from `main`
+- The [unstable next release](https://wpextract.readthedocs.io/en/dev/) is built from `dev`
+
+We use the `latest` version built from `main` as the public documentation, as this allows fixes to the live docs to be made without having to create a new release.
+
+The following parts of the documentation may require manual updates along with your changes:
+
+- the API reference documents a manually-selected set of classes, which cover the two high-level functionality classes and any necessary classes (or types) required to use them.
+- the CLI usage docs are manually written, generally copying the help messages but sometimes more detailed.
+- if changing the LangPicker base class, the examples of how to write language pickers [here](https://wpextract.readthedocs.io/en/latest/advanced/multilingual/).
+
+## Releasing
+
+To make a new release, we merge `dev` to `main` the tag the commit. This automatically triggers a workflow to publish to PyPI. After a while, the [conda-forge feedstock](https://github.com/conda-forge/wpextract-feedstock) will automatically receive a PR to update the version.
\ No newline at end of file
diff --git a/docs/advanced/multilingual.md b/docs/advanced/multilingual.md
index dd021d5..af901c2 100644
--- a/docs/advanced/multilingual.md
+++ b/docs/advanced/multilingual.md
@@ -1,6 +1,6 @@
# Multilingual Sites
-If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add multilingual data in the output dataset.
+If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add links between translated versions in the output dataset.
## Extraction Process
diff --git a/docs/changelog.md b/docs/changelog.md
index 8e94b93..5c03422 100644
--- a/docs/changelog.md
+++ b/docs/changelog.md
@@ -1,5 +1,20 @@
# Changelog
+## 1.0.3 (2024-08-06)
+
+**Changes**
+
+- Added missing `wpextract.__version__` attribute (!36)
+- Added `
`s to the elements to be ignored when extracting article text (!40)
+
+**Fixes**
+
+- Fixed incorrect behaviour extracting article text where only the first element to ignore (e.g. `figcaption`) would be ignored (!40)
+
+**Documentation**
+
+- Added proper references to the documentation of the [`langcodes`](https://github.com/georgkrause/langcodes) library (!38)
+
## 1.0.2 (2024-07-12)
- Fixed not explicitly declaring dependency on `urllib3` (!32)
diff --git a/docs/intro/install.md b/docs/intro/install.md
index 4d796b2..0d88157 100644
--- a/docs/intro/install.md
+++ b/docs/intro/install.md
@@ -73,12 +73,8 @@ or through importing as a library:
```pycon
>>> import wpextract
->>> help(wpextract)
-Help on package wpextract:
-
-NAME
- wpextract
-# etc...
+>>> wpextract.__version__
+1.0.0
```
## For Development
diff --git a/docs/intro/start.md b/docs/intro/start.md
index 7fd06af..f5c1dd7 100644
--- a/docs/intro/start.md
+++ b/docs/intro/start.md
@@ -25,9 +25,7 @@ $ pipx install wpextract
WPextract works in two steps:
1. The **downloader** uses the WordPress REST API to obtain all content on the site, which is stored as a single, long file
-2. The **extractor** converts this into a usable dataset by enriching the downloaded content. This includes extracting text, images, resolving links to posts/pages, and finding translated versions[^lang]
-
-[^lang]: {-} See the specific guide for more on multilingual extraction.
+2. The **extractor** converts this into a usable dataset by enriching the downloaded content. This includes extracting text, images, resolving links to posts/pages, and finding translated versions ([more on translation extraction](../advanced/multilingual.md))
We call these two stages using two CLI commands ([`wpextract download`](../usage/download.md#command-usage) and [`wpextract extract`](../usage/extract.md#command-usage)). Alternatively, WPExtract can be integrated into a project by [using it as a library](../advanced/library.md).
diff --git a/pyproject.toml b/pyproject.toml
index b3a41e0..cb2e7c0 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,6 +1,6 @@
[tool.poetry]
name="wpextract"
-version="1.0.2"
+version="1.0.3"
description="Create datasets from WordPress sites"
homepage="https://wpextract.readthedocs.io/"
documentation="https://wpextract.readthedocs.io/"
@@ -92,12 +92,6 @@ ignore = [
"D103", # Ignore method docstring errors in tests
"PD901", # Allow `df` variable name in tests
]
-#"src/wpextract/download/*" = [
-# "D415",
-# "D103",
-# "D101",
-# "D107"
-#]
[tool.ruff.lint.pydocstyle]
convention = "google"
diff --git a/src/wpextract/__init__.py b/src/wpextract/__init__.py
index 114cd96..19c1d87 100644
--- a/src/wpextract/__init__.py
+++ b/src/wpextract/__init__.py
@@ -1,3 +1,7 @@
+from importlib.metadata import version
+
from wpextract.downloader import WPDownloader as WPDownloader
from .extract import WPExtractor as WPExtractor
+
+__version__ = version("wpextract")
diff --git a/src/wpextract/parse/content.py b/src/wpextract/parse/content.py
index 5689c3f..1eb4348 100644
--- a/src/wpextract/parse/content.py
+++ b/src/wpextract/parse/content.py
@@ -10,7 +10,7 @@
from wpextract.extractors.media import get_caption
from wpextract.util.str import squash_whitespace
-EXCLUDED_CONTENT_TAGS = {"figcaption"}
+EXCLUDED_CONTENT_TAGS = {"figcaption", "table"}
NEWLINE_TAGS = {"br", "p"}
@@ -136,7 +136,7 @@ def extract_content_data(doc: BeautifulSoup, self_link: str) -> pd.Series:
"""Extract the links, embeds, images and text content of the document.
Args:
- doc: A parsed document.
+ doc: A parsed document body.
self_link: The URL of the page.
Returns:
@@ -147,12 +147,12 @@ def extract_content_data(doc: BeautifulSoup, self_link: str) -> pd.Series:
images = extract_images(doc, self_link)
doc_c = copy.copy(doc)
- for child in doc_c.descendants:
- if type(child) == NavigableString:
+ for child in list(doc_c.descendants):
+ if child.decomposed or type(child) == NavigableString:
continue
if child.name in EXCLUDED_CONTENT_TAGS:
- child.extract()
+ child.decompose()
content_text = squash_whitespace(_get_text(doc_c))
diff --git a/src/wpextract/parse/translations/_resolver.py b/src/wpextract/parse/translations/_resolver.py
index 72825f0..66b2b62 100644
--- a/src/wpextract/parse/translations/_resolver.py
+++ b/src/wpextract/parse/translations/_resolver.py
@@ -14,5 +14,9 @@ class TranslationLink(ResolvableLink):
@property
def language(self) -> Language:
- """Parsed and normalized language. Populated automatically post-init."""
+ """Parsed and normalized language. Populated automatically post-init.
+
+ See Also:
+ [`langcodes` documentation](https://github.com/georgkrause/langcodes?tab=readme-ov-file#language-objects)
+ """
return Language.get(self.lang, normalize=True)
diff --git a/tests/parse/test_content.py b/tests/parse/test_content.py
index 113ab02..2e4d370 100644
--- a/tests/parse/test_content.py
+++ b/tests/parse/test_content.py
@@ -82,7 +82,6 @@ def test_extract_image_without_src(datadir: Path):
def test_extract_content(datadir: Path):
doc = BeautifulSoup((datadir / "content_extraction.html").read_text(), "lxml")
-
content_series = extract_content_data(doc, "https://example.org/home")
text = content_series[0]
diff --git a/tests/parse/test_content/content_extraction.html b/tests/parse/test_content/content_extraction.html
index b64fd3c..6a0457d 100644
--- a/tests/parse/test_content/content_extraction.html
+++ b/tests/parse/test_content/content_extraction.html
@@ -1,8 +1,18 @@
+