Prototype extracting structured data from soil science papers to build a benchmark dataset for soil system models (e.g., carbon cycles).
This prototype extracts measurement data and metadata from soil science papers. Example input and output are detailed below.
{
"title": "Carbon persistence of soils with long-term biosolids amendments in California agroecosystems",
"authors": "Yocelyn B. Villa, Eddie Phosalath, Rebecca Ryals",
"year": 2024,
"doi": "10.1002/saj2.20632",
"locations": [
{
"name": "Sacramento",
"address": "Sacramento County, CA",
"latitude": 38.335083,
"longitude": -121.168472,
"measurements": [
{
"measurement_value": 22.0,
"measurement_name": "TOC in FLF",
"measurement_year": "2024",
"measurement_depth": "0-10 cm",
"measurement_unit": "Mg C/ha",
"group": "treatment"
},
{
"measurement_value": 7.84,
"measurement_name": "TOC in FLF",
"measurement_year": "2024",
"measurement_depth": "0-10 cm",
"measurement_unit": "Mg C/ha",
"group": "control"
},
... more
]
},
... more
]
}
For more examples, visit these links (continuously updating):
- Converting PDFs to Markdown works well.
- Extracting metadata like study name, authors, site location, and data collection year is easy.
- A universal structure is hard to create and may need iterative refinement.
- Models can confuse similar but distinct measurements.
This prototype has 2 steps:
- Convert PDFs to markdown.
- Extract structured data.
We convert PDFs to markdown using llama_parse
, which performs well for table extraction.
|Site|Treatment|Depth|Carbon (Mg C·ha−1)|Carbon (Mg C·ha−1)|Carbon (Mg C·ha−1)|Nitrogen (Mg N·ha−1)|Nitrogen (Mg N·ha−1)|Nitrogen (Mg N·ha−1)| |---|---|---|---|---| |Sacramento|Biosolids|0–10|22.00 ± 5.86|2.29 ± 0.29|1.38 ± 0.38|2.47 ± 0.47|0.37 ± 0.09|0.16 ± 0.03| |Sacramento|Control|0–10|7.84 ± 0.22|8.86 ± 1.44|3.84 ± 0.21|0.80 ± 0.19|0.70 ± 0.10|0.53 ± 0.04| |Solano|Biosolids|0–10|21.61 ± 1.18|8.95 ± 7.35|4.11 ± 0.61|3.29 ± 0.40|0.72 ± 0.59|0.62 ± 0.08| |Solano|Control|0–10|8.45 ± 3.77|1.85 ± 0.31|5.53 ± 0.30|0.73 ± 0.33|0.18 ± 0.04|0.77 ± 0.05| |Merced|Biosolids|0–10|1.55 ± 0.76|3.56 ± 1.13|7.74 ± 2.89|0.19 ± 0.13|0.40 ± 0.23|0.95 ± 0.15| |Merced|Control|0–10|10.18 ± 8.71|4.87 ± 0.53|4.83 ± 0.39|1.80 ± 1.66|0.88 ± 0.21|0.66 ± 0.03|
This preprocessing step is crucial, as parsing directly impacts downstream extraction performance.
We first define a flexible output structure to capture all necessary information, accommodating variations in measurements and data collection sites across studies.
A simples Paper
is defined as follow:
class Paper(BaseModel):
title: str
authors: str
year: int
doi: str
locations: list[Location]
A paper has title, authors, year, doi, and a list of locations.
A location is defined as follow:
class Location(BaseModel):
name: str
address: str
latitude: float
longitude: float
measurements: list[TopSoilOrganicCarbon | TopSoilOrganicCarbonChange]
A location has name, address, latitude, longitude, and a list of measurements
There are two types of measurements: change measurements and raw measurements.
class TopSoilOrganicCarbon(BaseModel):
"""Topsoil organic carbon weight measurements."""
measurement_name: str
measurement_value: float
measurement_group: Group
measurement_year: str
measurement_depth: str
measurement_unit: str
class TopSoilOrganicCarbonChange(BaseModel):
"""Change in topsoil organic carbon weight measurements."""
measurement_name: str
measurement_change_value: float
measurement_relative_to: str
measurement_year: str
measurement_depth: str
measurement_unit: str
Both measurement types include measurement_name, measurement_year, measurement_depth, and measurement_unit. The TopSoilOrganicCarbon
measurement has measurement_value and measurement_group (indicating treatment or control group). The TopSoilOrganicCarbonChange
measurement includes measurement_change_value and measurement_relative_to (indicating the reference point for the change).
There are two options in the Group: treatment or control.
class Group(StrEnum):
TREATMENT = auto()
CONTROL = auto()
Providing a structure and a simple prompt allows an LLM to find the relevant table in the PDF and return structured data:
Prompt
Extract research paper information. Get all topsoil (< 20 cm deep) organic carbon (TOC) measurements in the units of mass fractions/concentrations (e.g., g/kg, mg/g, mg/ha...) from the paper. Also, extract changes in TOC over time due to land cover and land use change or long-term treatments if they are available.
Code to run the pipeline: run.py.