Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance allergen scraping #485

Merged
merged 5 commits into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 34 additions & 15 deletions api-resto-02.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The resto API provides information about the student restaurants of Ghent Univer

These data are scraped from https://www.ugent.be/student/nl/meer-dan-studeren/resto.

The menu data is property of Ghent University. We don't guarantee the correctness or completeness of the data.
The menu data is property of Ghent University. We do not guarantee the correctness or completeness of the data.

## Versioning and status

Expand All @@ -15,7 +15,7 @@ This document describes the current version of the API, version 2.0.
| Version | Endpoint | Status |
|------------------------|---------------------------------------|---------|
| [1.0](api-resto-01.md) | https://hydra.ugent.be/api/1.0/resto/ | retired |
| 2.4 (this) | https://hydra.ugent.be/api/2.0/resto/ | current |
| 2.5 (this) | https://hydra.ugent.be/api/2.0/resto/ | current |

## Data dump

Expand All @@ -34,6 +34,7 @@ need all available data, it is probably easier and faster to download or clone t
- At some point in 2021 or early 2022, the zeus.ugent.be/hydra endpoint stopped working. We could fix it, but we assume
most clients have migrated or are able to.
- _October 2022_ - Allergen information was added.
- _January 2024_ - Allergen information has been added to vegetables, with the field `vegetables2`.

## Technical description

Expand Down Expand Up @@ -187,6 +188,15 @@ Returns the menu for each available day in the future, including today. Sample o
"Bloemkool",
"Prinsessengroenten"
],
"vegetables2": [
{
"kind": "vegan",
"name": "Bloemkool",
"allergens": [
"Bloemkool"
]
}
],
"message": "Alle studenten krijgen op vertoon van Hydra 150% korting."
}
]
Expand Down Expand Up @@ -256,13 +266,14 @@ A sample endpoint is `/menu/nl/2017/5/18.json`. Sample output is:

A menu object consists of:

| Field | Description |
|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `date` | The date of the menu. The date's format follows ISO 8601:2004's extended format (`YYYY-MM-DD`). |
| `open` | If set to `true`, the resto is open, otherwise not. If set to `false`. <br><br>Note that this is no guarantee: some days (like the weekends) are simply not present in the output. |
| `vegetables` | A list of available vegetables. |
| `meals` | A list of meal objects (see below). |
| `message` | Optional field containing a message to be displayed. Used for exceptional closures or changes in the menu. For example, if `open` is `false`, the message could be an explanation for the closure. |
| Field | Description |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `date` | The date of the menu. The date's format follows ISO 8601:2004's extended format (`YYYY-MM-DD`). |
| `open` | If set to `true`, the resto is open, otherwise not. If set to `false`. <br><br>Note that this is no guarantee: some days (like the weekends) are simply not present in the output. |
| `vegetables` | A list of available vegetables. |
| `vegetables2` | A list of available vegetables in object form, with the kind and allergen information present, see below. |
| `meals` | A list of meal objects (see below). |
| `message` | Optional field containing a message to be displayed. Used for exceptional closures or changes in the menu. For example, if `open` is `false`, the message could be an explanation for the closure. |

A meal object consists of:

Expand All @@ -274,19 +285,27 @@ A meal object consists of:
| `type` | The meal type. Is currently `main` or `side`, but applications must be able to handle changes to the possible values. |
| `allergens` | List of allergens, matched on a best-efforts basis from the [allergen information](#allergen-information). |

A vegetable object consists of:

| Field | Description |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `kind` | The kind of the vegetable. A subset of the meal kind, currently `meat`, `vegetarian`, or `vegan`. Applications must be able to handle changes to the possible values. |
| `name` | The name of the vegetable. |
| `allergens` | List of allergens, matched on a best-effort basis from the [allergen information](#allergen-information). |

> **Warning**
> The allergen information, like all other information in the API, is available on a best-efforts basis.
> The allergen information, like all other information in the API, is available on a best-effort basis.
> Particularly, this information IS NOT FIT to replace the legally mandated information about allergens.
> When showing these data to users, please inform them of this and link to the web page.

How an application handles changes to possible values (indicated above where this is applicable), is not specified.
How an application handles changes to possible values (indicated above where this is applicable) is not specified.
The application might simply ignore new values.

### Regular sandwiches

**Endpoint**: `GET /sandwiches/static.json`

Lists available regular sandwiches, their price and their ingredients. Sample output:
Lists available regular sandwiches, their price, and their ingredients. Sample output:

```json
[
Expand Down Expand Up @@ -327,9 +346,9 @@ as [Weekly sandwiches yearly](#weekly-sandwiches-yearly).
- _year_ -- Which year you want the sandwiches of. Values must be a positive integer. Currently, the earliest available
year is 2019 (but this might change in the future). ISO format: `YYYY`.

Starting in academic year 2020-2021, this is listed as "groentespread".
Starting in academic year 20202021, this is listed as "groentespread".

Lists all sandwiches which were or are available in the specified year. Sample output:
List all sandwiches that were or are available in the specified year. Sample output:

```json
[
Expand Down Expand Up @@ -388,7 +407,7 @@ Since that webpage is made manually, it is very possible that the names used her
menu.

> **Warning**
> This parser, as all other information in the API, is available on a best-efforts basis.
> This parser, as all other information in the API, is available on a best-effort basis.
> Particularly, this information IS NOT FIT to replace the legally mandated information about allergens.
> When showing these data to users, please inform them of this and link to the web page.

Expand Down
13 changes: 10 additions & 3 deletions server/scraper/resto/allergens.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/usr/bin/env python3
import argparse
import itertools
import os
import sys
from typing import Union
Expand All @@ -16,7 +17,10 @@
URL = "https://www.ugent.be/student/nl/meer-dan-studeren/resto/allergenen"
SKIPPED_ELEMENTS = [
"vegetarisch",
"vegan"
"vegan",
"veggie",
"msc",
"asc"
]


Expand All @@ -33,18 +37,21 @@ def parse_section_item(section_item: str) -> Union[dict[str, list[str]], None]:
item_name = "Soep van de dag"
item_allergen_list = section_item
else:
item_name, item_allergen_list = section_item.split(":", maxsplit=1)
item_name, item_allergen_list = section_item.rsplit(":", maxsplit=1)

# Sometimes a section will have extra info before the item list,
# this should not be parsed
if item_allergen_list == "":
return None

item_allergens = list(map(lambda a: a.strip(), item_allergen_list.split(",")))
# Split items with "-"
item_allergens = list(itertools.chain.from_iterable(item.split("-") for item in item_allergens))
item_allergens = [x.strip().strip(".") for x in item_allergens]

# Exclude last item, it is not an allergen but a diet name
# eg. 'Vegetarian' or 'Vegan'
return {item_name.lower(): sorted({x.strip(".") for x in item_allergens if x.strip(".") not in SKIPPED_ELEMENTS})}
return {item_name.lower(): sorted({x for x in item_allergens if x not in SKIPPED_ELEMENTS})}


def make_sections(
Expand Down
25 changes: 23 additions & 2 deletions server/scraper/resto/menu.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,8 @@
"warme maaltijden: vlees",
"groenten bij warme maaltijden",
"zetmeel",
"soep"
"soep",
"groenten bij warme maaltijden"
]


Expand Down Expand Up @@ -295,6 +296,11 @@ def find_allergens_for_food(allergens: Dict[str, str], food: str) -> list[str]:
found = []
for part in food_parts:
found += allergens.get(part, [])
# Also do the reverse search if we didn't find any allergens.
if not found:
for allergen_food, allergens in allergens.items():
if allergen_food in food:
found += allergens
return found


Expand All @@ -309,6 +315,7 @@ def get_day_menu(which, url, allergens: Dict[str, str]):
# system)
day_menu = pq(url=url)
vegetables = []
vegetables2 = []
meats = []
soups = []

Expand Down Expand Up @@ -394,14 +401,28 @@ def get_day_menu(which, url, allergens: Dict[str, str]):
meats.append(dict(price=price, name=name, kind=kind, hot=hot_cold, allergens=food_allergens))
elif HEADING_TO_TYPE[last_heading] == 'vegetables':
vegetables.append(meal)
if ":" in meal:
kind, name = meal.split(":")
if kind != 'vegan' and kind != 'vegetarian':
kind = 'meat'
else:
kind = 'meat'
name = meal
vegetable_allergens = find_allergens_for_food(allergens, name)
vegetable = {
'name': meal,
'kind': kind,
'allergens': vegetable_allergens
}
vegetables2.append(vegetable)
else:
raise ValueError(f"Oops, HEADING_TO_TYPE contains unknown value for {last_heading}.")

# sometimes the closed indicator has a different layout.
if not vegetables and not soups and not meats:
return dict(open=False)

r = dict(open=True, vegetables=vegetables, soup=soups, meat=meats)
r = dict(open=True, vegetables=vegetables, vegetables2=vegetables2, soup=soups, meat=meats)
return r


Expand Down
Loading