diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md new file mode 100644 index 0000000..c32e5e1 --- /dev/null +++ b/.github/CONTRIBUTING.md @@ -0,0 +1,76 @@ +# Contributing to WPextract + +We welcome contributions to WPextract! Here's some helpful guidelines. + +## Get Started + +To install WPextract for development, you'll need [Poetry](https://python-poetry.org/) to automatically manage the virtual environment. + +Fork and clone the repo, then run `poetry install` (or `poetry install --with docs` if you'd like to build the documentation locally). This will install all the dependencies and the package itself as an editable install. + +It's best practice to make an issue before a PR so any necessary discussions can be had. + +## Testing + +Tests for WPextract are written with pytest. Some tips for writing tests: + +* approximately follow the module structure in the tests directory +* use [pytest-datadir](https://pypi.org/project/pytest-datadir/) to handle disk usage +* make sure to properly mock parts of the code which make HTTP requests (see tests of the `download` module for help) + +To run tests, use: + +```shell-session +# Just run tests +$ make testonly +# Run tests and open coverage HTML +$ make test +``` + +## Linting + +We use [Ruff](https://docs.astral.sh/ruff/) to lint WPextract. This happens in two stages, which can be easily run with Make tasks: + +```shell-session +# Reformat code +$ make format +# Find problems, autofixing if possible +$ make lint +``` + +Both library code and tests are linted (although tests are slightly less restrictive, see `pyproject.toml`). + +## Branch Management + +Generally your contribution should be made to the `dev` branch. We will then merge it into `main` only when it's time to release. + +The exception to this is for documentation, where changes should be applied directly to `main` if they are corrections of the current documentation version (but still `dev` if they relate to upcoming changes). + +## Documentation + +Documentation for WPextract is built with Mkdocs and Read the Docs. + +To build documentation locally (ensuring that the project was installed with the `--with docs` flag), run: + +```shell-session +$ make docdev +``` + +When a PR is created, Read the Docs will build a preview version. If this isn't left as a comment on the PR, check [the dashboard here](https://readthedocs.org/projects/wpextract/builds/). + +Documentation is hosted at: + +- The [latest](https://wpextract.readthedocs.io/en/latest/) version is built from `main` +- The [unstable next release](https://wpextract.readthedocs.io/en/dev/) is built from `dev` + +We use the `latest` version built from `main` as the public documentation, as this allows fixes to the live docs to be made without having to create a new release. + +The following parts of the documentation may require manual updates along with your changes: + +- the API reference documents a manually-selected set of classes, which cover the two high-level functionality classes and any necessary classes (or types) required to use them. +- the CLI usage docs are manually written, generally copying the help messages but sometimes more detailed. +- if changing the LangPicker base class, the examples of how to write language pickers [here](https://wpextract.readthedocs.io/en/latest/advanced/multilingual/). + +## Releasing + +To make a new release, we merge `dev` to `main` the tag the commit. This automatically triggers a workflow to publish to PyPI. After a while, the [conda-forge feedstock](https://github.com/conda-forge/wpextract-feedstock) will automatically receive a PR to update the version. \ No newline at end of file diff --git a/docs/advanced/multilingual.md b/docs/advanced/multilingual.md index dd021d5..af901c2 100644 --- a/docs/advanced/multilingual.md +++ b/docs/advanced/multilingual.md @@ -1,6 +1,6 @@ # Multilingual Sites -If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add multilingual data in the output dataset. +If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add links between translated versions in the output dataset. ## Extraction Process diff --git a/docs/changelog.md b/docs/changelog.md index 8e94b93..5c03422 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -1,5 +1,20 @@ # Changelog +## 1.0.3 (2024-08-06) + +**Changes** + +- Added missing `wpextract.__version__` attribute (!36) +- Added ``s to the elements to be ignored when extracting article text (!40) + +**Fixes** + +- Fixed incorrect behaviour extracting article text where only the first element to ignore (e.g. `figcaption`) would be ignored (!40) + +**Documentation** + +- Added proper references to the documentation of the [`langcodes`](https://github.com/georgkrause/langcodes) library (!38) + ## 1.0.2 (2024-07-12) - Fixed not explicitly declaring dependency on `urllib3` (!32) diff --git a/docs/intro/install.md b/docs/intro/install.md index 4d796b2..0d88157 100644 --- a/docs/intro/install.md +++ b/docs/intro/install.md @@ -73,12 +73,8 @@ or through importing as a library: ```pycon >>> import wpextract ->>> help(wpextract) -Help on package wpextract: - -NAME - wpextract -# etc... +>>> wpextract.__version__ +1.0.0 ``` ## For Development diff --git a/docs/intro/start.md b/docs/intro/start.md index 7fd06af..f5c1dd7 100644 --- a/docs/intro/start.md +++ b/docs/intro/start.md @@ -25,9 +25,7 @@ $ pipx install wpextract WPextract works in two steps: 1. The **downloader** uses the WordPress REST API to obtain all content on the site, which is stored as a single, long file -2. The **extractor** converts this into a usable dataset by enriching the downloaded content. This includes extracting text, images, resolving links to posts/pages, and finding translated versions[^lang] - -[^lang]: {-} See the specific guide for more on multilingual extraction. +2. The **extractor** converts this into a usable dataset by enriching the downloaded content. This includes extracting text, images, resolving links to posts/pages, and finding translated versions ([more on translation extraction](../advanced/multilingual.md)) We call these two stages using two CLI commands ([`wpextract download`](../usage/download.md#command-usage) and [`wpextract extract`](../usage/extract.md#command-usage)). Alternatively, WPExtract can be integrated into a project by [using it as a library](../advanced/library.md). diff --git a/pyproject.toml b/pyproject.toml index b3a41e0..cb2e7c0 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [tool.poetry] name="wpextract" -version="1.0.2" +version="1.0.3" description="Create datasets from WordPress sites" homepage="https://wpextract.readthedocs.io/" documentation="https://wpextract.readthedocs.io/" @@ -92,12 +92,6 @@ ignore = [ "D103", # Ignore method docstring errors in tests "PD901", # Allow `df` variable name in tests ] -#"src/wpextract/download/*" = [ -# "D415", -# "D103", -# "D101", -# "D107" -#] [tool.ruff.lint.pydocstyle] convention = "google" diff --git a/src/wpextract/__init__.py b/src/wpextract/__init__.py index 114cd96..19c1d87 100644 --- a/src/wpextract/__init__.py +++ b/src/wpextract/__init__.py @@ -1,3 +1,7 @@ +from importlib.metadata import version + from wpextract.downloader import WPDownloader as WPDownloader from .extract import WPExtractor as WPExtractor + +__version__ = version("wpextract") diff --git a/src/wpextract/parse/content.py b/src/wpextract/parse/content.py index 5689c3f..1eb4348 100644 --- a/src/wpextract/parse/content.py +++ b/src/wpextract/parse/content.py @@ -10,7 +10,7 @@ from wpextract.extractors.media import get_caption from wpextract.util.str import squash_whitespace -EXCLUDED_CONTENT_TAGS = {"figcaption"} +EXCLUDED_CONTENT_TAGS = {"figcaption", "table"} NEWLINE_TAGS = {"br", "p"} @@ -136,7 +136,7 @@ def extract_content_data(doc: BeautifulSoup, self_link: str) -> pd.Series: """Extract the links, embeds, images and text content of the document. Args: - doc: A parsed document. + doc: A parsed document body. self_link: The URL of the page. Returns: @@ -147,12 +147,12 @@ def extract_content_data(doc: BeautifulSoup, self_link: str) -> pd.Series: images = extract_images(doc, self_link) doc_c = copy.copy(doc) - for child in doc_c.descendants: - if type(child) == NavigableString: + for child in list(doc_c.descendants): + if child.decomposed or type(child) == NavigableString: continue if child.name in EXCLUDED_CONTENT_TAGS: - child.extract() + child.decompose() content_text = squash_whitespace(_get_text(doc_c)) diff --git a/src/wpextract/parse/translations/_resolver.py b/src/wpextract/parse/translations/_resolver.py index 72825f0..66b2b62 100644 --- a/src/wpextract/parse/translations/_resolver.py +++ b/src/wpextract/parse/translations/_resolver.py @@ -14,5 +14,9 @@ class TranslationLink(ResolvableLink): @property def language(self) -> Language: - """Parsed and normalized language. Populated automatically post-init.""" + """Parsed and normalized language. Populated automatically post-init. + + See Also: + [`langcodes` documentation](https://github.com/georgkrause/langcodes?tab=readme-ov-file#language-objects) + """ return Language.get(self.lang, normalize=True) diff --git a/tests/parse/test_content.py b/tests/parse/test_content.py index 113ab02..2e4d370 100644 --- a/tests/parse/test_content.py +++ b/tests/parse/test_content.py @@ -82,7 +82,6 @@ def test_extract_image_without_src(datadir: Path): def test_extract_content(datadir: Path): doc = BeautifulSoup((datadir / "content_extraction.html").read_text(), "lxml") - content_series = extract_content_data(doc, "https://example.org/home") text = content_series[0] diff --git a/tests/parse/test_content/content_extraction.html b/tests/parse/test_content/content_extraction.html index b64fd3c..6a0457d 100644 --- a/tests/parse/test_content/content_extraction.html +++ b/tests/parse/test_content/content_extraction.html @@ -1,8 +1,18 @@ +

The first paragraph.

Some alt text
A caption
+
+ Some alt text +
A second caption
+
+
+ + + +
This is inside a table cell

The second paragraph.

The third paragraph.

Not in a paragraph.