Added NLP-Based Section Matching and Data Extraction Logic Proof of Concept #125

SiriChandanaGarimella · 2024-12-17T07:00:36Z

Fixes #99

What was changed?

The section matching logic in the ORCALogProcessor class is implemented.
Enhanced NLP capabilities were added using:

Text normalization (lowercasing and lemmatization with spaCy).
Fuzzy matching for near-similar terms (e.g., "gibbs free energies" → "gibbs free energy").

Updated _nlp_match_section and added a helper function _lemmatize_term

Why was it changed?
The current rule-based system needs to be replaced with a more robust solution that can maintain accuracy, scale easily and efficiently, and reduce maintenance. Based on the analysis from Issue #98, implemented an NLP-based system for extracting data from ORCA log files as a proof of concept.

How was it changed?
Files added: orca_log_processor.py - Contains the implementation of the ORCALogProcessor class.

Key Functionalities Added:
Using Natural Language Processing (NLP), improved data extraction from text files, particularly when dealing with variations in section names (e.g., "INNER ENERGIES" vs "INNER ENERGY"). Here’s what is done:

Normalization of Text:
- Text is converted to lowercase to ensure case insensitivity during matching.
- Words are lemmatized using spaCy, which reduces words to their base form (e.g., "energies" → "energy").
Fuzzy Matching:
- The similarity between terms is calculated using SequenceMatcher from Python's difflib module. This helps match near-similar words with a threshold of 90% (e.g., "gibbs free energies" vs "gibbs free energy").
Multi-Word Matching:
- Multi-word terms like "gibbs free energies" are split and lemmatized word-by-word for robust comparison.
Section Matching:
- The extracted and normalized search term is compared against known section patterns (self.section_patterns) to find the correct section.
Cycle Boundaries Detection:
- Automatically detects start and end lines of GEOMETRY OPTIMIZATION CYCLES using regex.

Future Improvements for Data Extraction
As this is a Proof of Concept (POC), further improvements can include:

Advanced NLP Models: Use more advanced NLP models like BERT or GPT to identify and extract sections based on semantic meaning, not just exact/fuzzy matches.
Keyword Extraction: Automatically extract keywords and patterns from text files instead of hardcoding section names.
Regular Expression Enhancements: Use regex to detect section headers more efficiently.
User Input Flexibility: Allow users to specify custom similarity thresholds or patterns dynamically.
Error Handling and Logs: Provide detailed logs for unmatched terms and offer suggestions for valid inputs.
Visualization: Visualize matched sections or highlight matched keywords for better readability.
Integration with Machine Learning: Train a classifier to automatically detect and group sections from new datasets.
Enhance performance for very large log files.

Screenshot
Input:
search_term: GIBBS FREE ENERGIES
lines_specified: FIRST 30
sections: 1, 2, 3

Preview Document Output:

SiriChandanaGarimella added 3 commits December 17, 2024 00:21

Implemented NLP POC

1221366

fixed failing test case

4b2dd50

Added coverage and pytest to requirements.txt

ffedc38

SiriChandanaGarimella requested a review from kungfuchicken December 17, 2024 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added NLP-Based Section Matching and Data Extraction Logic Proof of Concept #125

Added NLP-Based Section Matching and Data Extraction Logic Proof of Concept #125

SiriChandanaGarimella commented Dec 17, 2024 •

edited

Loading

Added NLP-Based Section Matching and Data Extraction Logic Proof of Concept #125

Are you sure you want to change the base?

Added NLP-Based Section Matching and Data Extraction Logic Proof of Concept #125

Conversation

SiriChandanaGarimella commented Dec 17, 2024 • edited Loading

SiriChandanaGarimella commented Dec 17, 2024 •

edited

Loading