Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added NLP-Based Section Matching and Data Extraction Logic Proof of Concept #125

Open
wants to merge 3 commits into
base: nlp_enhancements
Choose a base branch
from

Conversation

SiriChandanaGarimella
Copy link
Collaborator

@SiriChandanaGarimella SiriChandanaGarimella commented Dec 17, 2024

Fixes #99

What was changed?

  1. The section matching logic in the ORCALogProcessor class is implemented.
  2. Enhanced NLP capabilities were added using:
  • Text normalization (lowercasing and lemmatization with spaCy).
  • Fuzzy matching for near-similar terms (e.g., "gibbs free energies" → "gibbs free energy").
  1. Updated _nlp_match_section and added a helper function _lemmatize_term

Why was it changed?
The current rule-based system needs to be replaced with a more robust solution that can maintain accuracy, scale easily and efficiently, and reduce maintenance. Based on the analysis from Issue #98, implemented an NLP-based system for extracting data from ORCA log files as a proof of concept.

How was it changed?
Files added: orca_log_processor.py - Contains the implementation of the ORCALogProcessor class.

Key Functionalities Added:
Using Natural Language Processing (NLP), improved data extraction from text files, particularly when dealing with variations in section names (e.g., "INNER ENERGIES" vs "INNER ENERGY"). Here’s what is done:

  1. Normalization of Text:
    • Text is converted to lowercase to ensure case insensitivity during matching.
    • Words are lemmatized using spaCy, which reduces words to their base form (e.g., "energies" → "energy").
  2. Fuzzy Matching:
    • The similarity between terms is calculated using SequenceMatcher from Python's difflib module. This helps match near-similar words with a threshold of 90% (e.g., "gibbs free energies" vs "gibbs free energy").
  3. Multi-Word Matching:
    • Multi-word terms like "gibbs free energies" are split and lemmatized word-by-word for robust comparison.
  4. Section Matching:
    • The extracted and normalized search term is compared against known section patterns (self.section_patterns) to find the correct section.
  5. Cycle Boundaries Detection:
    • Automatically detects start and end lines of GEOMETRY OPTIMIZATION CYCLES using regex.

Future Improvements for Data Extraction
As this is a Proof of Concept (POC), further improvements can include:

  1. Advanced NLP Models: Use more advanced NLP models like BERT or GPT to identify and extract sections based on semantic meaning, not just exact/fuzzy matches.
  2. Keyword Extraction: Automatically extract keywords and patterns from text files instead of hardcoding section names.
  3. Regular Expression Enhancements: Use regex to detect section headers more efficiently.
  4. User Input Flexibility: Allow users to specify custom similarity thresholds or patterns dynamically.
  5. Error Handling and Logs: Provide detailed logs for unmatched terms and offer suggestions for valid inputs.
  6. Visualization: Visualize matched sections or highlight matched keywords for better readability.
  7. Integration with Machine Learning: Train a classifier to automatically detect and group sections from new datasets.
  8. Enhance performance for very large log files.

Screenshot
Input:
search_term: GIBBS FREE ENERGIES
lines_specified: FIRST 30
sections: 1, 2, 3

Preview Document Output:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant