Release 1.0.3 (#42)

* Add package version attribute (#36) * Add version attribute to package * Revert "Hotfix: remove usage of __version__ in docs (#35)" This reverts commit 641375a. * add contributing guidelines (#37) * Add ref to langcodes docs (#38) * add manual ref to Language class * fix footnote in start * make opening to multilingual docs clearer * Fix element exclusion in text extraction (#40) * Prepare 1.0.3 release (#41) * prepare 1.0.3 * fix changelog sections
GateNLP · Aug 6, 2024 · aa8f449 · aa8f449
1 parent 641375a
commit aa8f449
Show file tree

Hide file tree

Showing 11 changed files with 120 additions and 24 deletions.
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -0,0 +1,76 @@
+# Contributing to WPextract
+
+We welcome contributions to WPextract! Here's some helpful guidelines.
+
+## Get Started
+
+To install WPextract for development, you'll need [Poetry](https://python-poetry.org/) to automatically manage the virtual environment.
+
+Fork and clone the repo, then run `poetry install` (or `poetry install --with docs` if you'd like to build the documentation locally). This will install all the dependencies and the package itself as an editable install.
+
+It's best practice to make an issue before a PR so any necessary discussions can be had. 
+
+## Testing
+
+Tests for WPextract are written with pytest. Some tips for writing tests:
+
+* approximately follow the module structure in the tests directory
+* use [pytest-datadir](https://pypi.org/project/pytest-datadir/) to handle disk usage
+* make sure to properly mock parts of the code which make HTTP requests (see tests of the `download` module for help)
+
+To run tests, use:
+
+```shell-session
+# Just run tests
+$ make testonly 
+# Run tests and open coverage HTML
+$ make test
+```
+
+## Linting
+
+We use [Ruff](https://docs.astral.sh/ruff/) to lint WPextract. This happens in two stages, which can be easily run with Make tasks:
+
+```shell-session
+# Reformat code
+$ make format
+# Find problems, autofixing if possible
+$ make lint
+``` 
+
+Both library code and tests are linted (although tests are slightly less restrictive, see `pyproject.toml`).
+
+## Branch Management
+
+Generally your contribution should be made to the `dev` branch. We will then merge it into `main` only when it's time to release.
+
+The exception to this is for documentation, where changes should be applied directly to `main` if they are corrections of the current documentation version (but still `dev` if they relate to upcoming changes).
+
+## Documentation
+
+Documentation for WPextract is built with Mkdocs and Read the Docs.
+
+To build documentation locally (ensuring that the project was installed with the `--with docs` flag), run:
+
+```shell-session
+$ make docdev
+```
+
+When a PR is created, Read the Docs will build a preview version. If this isn't left as a comment on the PR, check [the dashboard here](https://readthedocs.org/projects/wpextract/builds/).
+
+Documentation is hosted at:
+
+- The [latest](https://wpextract.readthedocs.io/en/latest/) version is built from `main`
+- The [unstable next release](https://wpextract.readthedocs.io/en/dev/) is built from `dev`
+
+We use the `latest` version built from `main` as the public documentation, as this allows fixes to the live docs to be made without having to create a new release.  
+
+The following parts of the documentation may require manual updates along with your changes:
+
+- the API reference documents a manually-selected set of classes, which cover the two high-level functionality classes and any necessary classes (or types) required to use them.
+- the CLI usage docs are manually written, generally copying the help messages but sometimes more detailed.
+- if changing the LangPicker base class, the examples of how to write language pickers [here](https://wpextract.readthedocs.io/en/latest/advanced/multilingual/).
+
+## Releasing
+
+To make a new release, we merge `dev` to `main` the tag the commit. This automatically triggers a workflow to publish to PyPI. After a while, the [conda-forge feedstock](https://github.com/conda-forge/wpextract-feedstock) will automatically receive a PR to update the version.
diff --git a/docs/advanced/multilingual.md b/docs/advanced/multilingual.md
@@ -1,6 +1,6 @@
 # Multilingual Sites
 
-If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add multilingual data in the output dataset.
+If sites publish in multiple languages and use a plugin to present a list of language versions, wpextract can parse this and add links between translated versions in the output dataset.
 
 ## Extraction Process
 

diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,5 +1,20 @@
 # Changelog
 
+## 1.0.3 (2024-08-06)
+
+**Changes**
+
+- Added missing `wpextract.__version__` attribute (!36)
+- Added `<table>`s to the elements to be ignored when extracting article text (!40)
+
+**Fixes**
+
+- Fixed incorrect behaviour extracting article text where only the first element to ignore (e.g. `figcaption`) would be ignored (!40)
+
+**Documentation**
+
+- Added proper references to the documentation of the [`langcodes`](https://github.com/georgkrause/langcodes) library (!38)
+
 ## 1.0.2 (2024-07-12)
 
 - Fixed not explicitly declaring dependency on `urllib3` (!32)

diff --git a/docs/intro/install.md b/docs/intro/install.md
@@ -73,12 +73,8 @@ or through importing as a library:
 
 ```pycon
 >>> import wpextract
->>> help(wpextract)
-Help on package wpextract:
-
-NAME
-    wpextract
-# etc...
+>>> wpextract.__version__
+1.0.0
 ```
 
 ## For Development

diff --git a/docs/intro/start.md b/docs/intro/start.md
@@ -25,9 +25,7 @@ $ pipx install wpextract
 WPextract works in two steps:
 
 1. The **downloader** uses the WordPress REST API to obtain all content on the site, which is stored as a single, long file
-2. The **extractor** converts this into a usable dataset by enriching the downloaded content. This includes extracting text, images, resolving links to posts/pages, and finding translated versions[^lang]
-
-[^lang]: {-} See the specific guide for more on multilingual extraction.
+2. The **extractor** converts this into a usable dataset by enriching the downloaded content. This includes extracting text, images, resolving links to posts/pages, and finding translated versions ([more on translation extraction](../advanced/multilingual.md))
 
 We call these two stages using two CLI commands ([`wpextract download`](../usage/download.md#command-usage) and [`wpextract extract`](../usage/extract.md#command-usage)). Alternatively, WPExtract can be integrated into a project by [using it as a library](../advanced/library.md).
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name="wpextract"
-version="1.0.2"
+version="1.0.3"
 description="Create datasets from WordPress sites"
 homepage="https://wpextract.readthedocs.io/"
 documentation="https://wpextract.readthedocs.io/"
@@ -92,12 +92,6 @@ ignore = [
     "D103", # Ignore method docstring errors in tests
     "PD901", # Allow `df` variable name in tests
 ]
-#"src/wpextract/download/*" = [
-#    "D415",
-#    "D103",
-#    "D101",
-#    "D107"
-#]
 
 [tool.ruff.lint.pydocstyle]
 convention = "google"

diff --git a/src/wpextract/__init__.py b/src/wpextract/__init__.py
@@ -1,3 +1,7 @@
+from importlib.metadata import version
+
 from wpextract.downloader import WPDownloader as WPDownloader
 
 from .extract import WPExtractor as WPExtractor
+
+__version__ = version("wpextract")
diff --git a/src/wpextract/parse/content.py b/src/wpextract/parse/content.py
@@ -10,7 +10,7 @@
 from wpextract.extractors.media import get_caption
 from wpextract.util.str import squash_whitespace
 
-EXCLUDED_CONTENT_TAGS = {"figcaption"}
+EXCLUDED_CONTENT_TAGS = {"figcaption", "table"}
 NEWLINE_TAGS = {"br", "p"}
 
 
@@ -136,7 +136,7 @@ def extract_content_data(doc: BeautifulSoup, self_link: str) -> pd.Series:
     """Extract the links, embeds, images and text content of the document.
 
     Args:
-        doc: A parsed document.
+        doc: A parsed document body.
         self_link: The URL of the page.
 
     Returns:
@@ -147,12 +147,12 @@ def extract_content_data(doc: BeautifulSoup, self_link: str) -> pd.Series:
     images = extract_images(doc, self_link)
 
     doc_c = copy.copy(doc)
-    for child in doc_c.descendants:
-        if type(child) == NavigableString:
+    for child in list(doc_c.descendants):
+        if child.decomposed or type(child) == NavigableString:
             continue
 
         if child.name in EXCLUDED_CONTENT_TAGS:
-            child.extract()
+            child.decompose()
 
     content_text = squash_whitespace(_get_text(doc_c))
 

diff --git a/src/wpextract/parse/translations/_resolver.py b/src/wpextract/parse/translations/_resolver.py
@@ -14,5 +14,9 @@ class TranslationLink(ResolvableLink):
 
     @property
     def language(self) -> Language:
-        """Parsed and normalized language. Populated automatically post-init."""
+        """Parsed and normalized language. Populated automatically post-init.
+
+        See Also:
+            [`langcodes` documentation](https://github.com/georgkrause/langcodes?tab=readme-ov-file#language-objects)
+        """
         return Language.get(self.lang, normalize=True)
diff --git a/tests/parse/test_content.py b/tests/parse/test_content.py
@@ -82,7 +82,6 @@ def test_extract_image_without_src(datadir: Path):
 
 def test_extract_content(datadir: Path):
     doc = BeautifulSoup((datadir / "content_extraction.html").read_text(), "lxml")
-
     content_series = extract_content_data(doc, "https://example.org/home")
     text = content_series[0]
 

diff --git a/tests/parse/test_content/content_extraction.html b/tests/parse/test_content/content_extraction.html
@@ -1,8 +1,18 @@
+<!-- Note: this isn't a complete HTML doc because it's designed to run on just the content -->
 <p>The first paragraph.</p>
 <figure>
   <img src="/example-image.png" alt="Some alt text" />
   <figcaption>A caption</figcaption>
 </figure>
+<figure>
+  <img src="/example-image.png" alt="Some alt text" />
+  <figcaption>A second caption</figcaption>
+</figure>
+<table>
+  <tr>
+    <td>This is inside a table cell</td>
+  </tr>
+</table>
 <!-- a comment, I should be ignored -->
 <p>The second paragraph.</p><p>The third paragraph.</p>
 Not in a paragraph.