Skip to content

Commit

Permalink
Release 1.0.3 (#42)
Browse files Browse the repository at this point in the history
* Add package version attribute (#36)

* Add version attribute to package

* Revert "Hotfix: remove usage of __version__ in docs (#35)"

This reverts commit 641375a.

* add contributing guidelines (#37)

* Add ref to langcodes docs (#38)

* add manual ref to Language class

* fix footnote in start

* make opening to multilingual docs clearer

* Fix element exclusion in text extraction (#40)

* Prepare 1.0.3 release (#41)

* prepare 1.0.3

* fix changelog sections
  • Loading branch information
freddyheppell authored Aug 6, 2024
1 parent 641375a commit aa8f449
Show file tree
Hide file tree
Showing 11 changed files with 120 additions and 24 deletions.
76 changes: 76 additions & 0 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Contributing to WPextract

We welcome contributions to WPextract! Here's some helpful guidelines.

## Get Started

To install WPextract for development, you'll need [Poetry](https://python-poetry.org/) to automatically manage the virtual environment.

Fork and clone the repo, then run `poetry install` (or `poetry install --with docs` if you'd like to build the documentation locally). This will install all the dependencies and the package itself as an editable install.

It's best practice to make an issue before a PR so any necessary discussions can be had.

## Testing

Tests for WPextract are written with pytest. Some tips for writing tests:

* approximately follow the module structure in the tests directory
* use [pytest-datadir](https://pypi.org/project/pytest-datadir/) to handle disk usage
* make sure to properly mock parts of the code which make HTTP requests (see tests of the `download` module for help)

To run tests, use:

```shell-session
# Just run tests
$ make testonly
# Run tests and open coverage HTML
$ make test
```

## Linting

We use [Ruff](https://docs.astral.sh/ruff/) to lint WPextract. This happens in two stages, which can be easily run with Make tasks:

```shell-session
# Reformat code
$ make format
# Find problems, autofixing if possible
$ make lint
```

Both library code and tests are linted (although tests are slightly less restrictive, see `pyproject.toml`).

## Branch Management

Generally your contribution should be made to the `dev` branch. We will then merge it into `main` only when it's time to release.

The exception to this is for documentation, where changes should be applied directly to `main` if they are corrections of the current documentation version (but still `dev` if they relate to upcoming changes).

## Documentation

Documentation for WPextract is built with Mkdocs and Read the Docs.

To build documentation locally (ensuring that the project was installed with the `--with docs` flag), run:

```shell-session
$ make docdev
```

When a PR is created, Read the Docs will build a preview version. If this isn't left as a comment on the PR, check [the dashboard here](https://readthedocs.org/projects/wpextract/builds/).

Documentation is hosted at:

- The [latest](https://wpextract.readthedocs.io/en/latest/) version is built from `main`
- The [unstable next release](https://wpextract.readthedocs.io/en/dev/) is built from `dev`

We use the `latest` version built from `main` as the public documentation, as this allows fixes to the live docs to be made without having to create a new release.

The following parts of the documentation may require manual updates along with your changes:

- the API reference documents a manually-selected set of classes, which cover the two high-level functionality classes and any necessary classes (or types) required to use them.
- the CLI usage docs are manually written, generally copying the help messages but sometimes more detailed.
- if changing the LangPicker base class, the examples of how to write language pickers [here](https://wpextract.readthedocs.io/en/latest/advanced/multilingual/).

## Releasing

To make a new release, we merge `dev` to `main` the tag the commit. This automatically triggers a workflow to publish to PyPI. After a while, the [conda-forge feedstock](https://github.com/conda-forge/wpextract-feedstock) will automatically receive a PR to update the version.
2 changes: 1 addition & 1 deletion docs/advanced/multilingual.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Multilingual Sites

If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add multilingual data in the output dataset.
If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add links between translated versions in the output dataset.

## Extraction Process

Expand Down
15 changes: 15 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# Changelog

## 1.0.3 (2024-08-06)

**Changes**

- Added missing `wpextract.__version__` attribute (!36)
- Added `<table>`s to the elements to be ignored when extracting article text (!40)

**Fixes**

- Fixed incorrect behaviour extracting article text where only the first element to ignore (e.g. `figcaption`) would be ignored (!40)

**Documentation**

- Added proper references to the documentation of the [`langcodes`](https://github.com/georgkrause/langcodes) library (!38)

## 1.0.2 (2024-07-12)

- Fixed not explicitly declaring dependency on `urllib3` (!32)
Expand Down
8 changes: 2 additions & 6 deletions docs/intro/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,12 +73,8 @@ or through importing as a library:

```pycon
>>> import wpextract
>>> help(wpextract)
Help on package wpextract:

NAME
wpextract
# etc...
>>> wpextract.__version__
1.0.0
```

## For Development
Expand Down
4 changes: 1 addition & 3 deletions docs/intro/start.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,7 @@ $ pipx install wpextract
WPextract works in two steps:

1. The **downloader** uses the WordPress REST API to obtain all content on the site, which is stored as a single, long file
2. The **extractor** converts this into a usable dataset by enriching the downloaded content. This includes extracting text, images, resolving links to posts/pages, and finding translated versions[^lang]

[^lang]: {-} See the specific guide for more on multilingual extraction.
2. The **extractor** converts this into a usable dataset by enriching the downloaded content. This includes extracting text, images, resolving links to posts/pages, and finding translated versions ([more on translation extraction](../advanced/multilingual.md))

We call these two stages using two CLI commands ([`wpextract download`](../usage/download.md#command-usage) and [`wpextract extract`](../usage/extract.md#command-usage)). Alternatively, WPExtract can be integrated into a project by [using it as a library](../advanced/library.md).

Expand Down
8 changes: 1 addition & 7 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name="wpextract"
version="1.0.2"
version="1.0.3"
description="Create datasets from WordPress sites"
homepage="https://wpextract.readthedocs.io/"
documentation="https://wpextract.readthedocs.io/"
Expand Down Expand Up @@ -92,12 +92,6 @@ ignore = [
"D103", # Ignore method docstring errors in tests
"PD901", # Allow `df` variable name in tests
]
#"src/wpextract/download/*" = [
# "D415",
# "D103",
# "D101",
# "D107"
#]

[tool.ruff.lint.pydocstyle]
convention = "google"
Expand Down
4 changes: 4 additions & 0 deletions src/wpextract/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
from importlib.metadata import version

from wpextract.downloader import WPDownloader as WPDownloader

from .extract import WPExtractor as WPExtractor

__version__ = version("wpextract")
10 changes: 5 additions & 5 deletions src/wpextract/parse/content.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from wpextract.extractors.media import get_caption
from wpextract.util.str import squash_whitespace

EXCLUDED_CONTENT_TAGS = {"figcaption"}
EXCLUDED_CONTENT_TAGS = {"figcaption", "table"}
NEWLINE_TAGS = {"br", "p"}


Expand Down Expand Up @@ -136,7 +136,7 @@ def extract_content_data(doc: BeautifulSoup, self_link: str) -> pd.Series:
"""Extract the links, embeds, images and text content of the document.
Args:
doc: A parsed document.
doc: A parsed document body.
self_link: The URL of the page.
Returns:
Expand All @@ -147,12 +147,12 @@ def extract_content_data(doc: BeautifulSoup, self_link: str) -> pd.Series:
images = extract_images(doc, self_link)

doc_c = copy.copy(doc)
for child in doc_c.descendants:
if type(child) == NavigableString:
for child in list(doc_c.descendants):
if child.decomposed or type(child) == NavigableString:
continue

if child.name in EXCLUDED_CONTENT_TAGS:
child.extract()
child.decompose()

content_text = squash_whitespace(_get_text(doc_c))

Expand Down
6 changes: 5 additions & 1 deletion src/wpextract/parse/translations/_resolver.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,9 @@ class TranslationLink(ResolvableLink):

@property
def language(self) -> Language:
"""Parsed and normalized language. Populated automatically post-init."""
"""Parsed and normalized language. Populated automatically post-init.
See Also:
[`langcodes` documentation](https://github.com/georgkrause/langcodes?tab=readme-ov-file#language-objects)
"""
return Language.get(self.lang, normalize=True)
1 change: 0 additions & 1 deletion tests/parse/test_content.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,6 @@ def test_extract_image_without_src(datadir: Path):

def test_extract_content(datadir: Path):
doc = BeautifulSoup((datadir / "content_extraction.html").read_text(), "lxml")

content_series = extract_content_data(doc, "https://example.org/home")
text = content_series[0]

Expand Down
10 changes: 10 additions & 0 deletions tests/parse/test_content/content_extraction.html
Original file line number Diff line number Diff line change
@@ -1,8 +1,18 @@
<!-- Note: this isn't a complete HTML doc because it's designed to run on just the content -->
<p>The first paragraph.</p>
<figure>
<img src="/example-image.png" alt="Some alt text" />
<figcaption>A caption</figcaption>
</figure>
<figure>
<img src="/example-image.png" alt="Some alt text" />
<figcaption>A second caption</figcaption>
</figure>
<table>
<tr>
<td>This is inside a table cell</td>
</tr>
</table>
<!-- a comment, I should be ignored -->
<p>The second paragraph.</p><p>The third paragraph.</p>
Not in a paragraph.
Expand Down

0 comments on commit aa8f449

Please sign in to comment.