Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Refactoring #128

Merged
merged 23 commits into from
Jul 6, 2024
Merged

Performance Refactoring #128

merged 23 commits into from
Jul 6, 2024

Conversation

bpepple
Copy link
Member

@bpepple bpepple commented Jul 4, 2024

Some refactoring to improve performance.

bpepple added 21 commits July 1, 2024 13:42
- Reduce I/O Operations: Instead of opening each image inside the inner loop, we can open all images at once and process them.
- Optimize Looping: Use list comprehensions where possible to make the code more efficient and Pythonic.
- Error Handling: Move error handling outside the inner loop to avoid repeated checks.
- Used the apply method to directly create DuplicateIssue objects for each matching row, eliminating the need for an explicit loop.
- Used list comprehension to collect the results, which is more efficient in terms of performance.
- Used List Comprehension for Filtering: Instead of using a for-loop, use list comprehension to filter out the items where pages are successfully removed.
- Batch Processing: Process the removal and feedback in a single loop to reduce the overhead of multiple function calls.
- Used query to filter the DataFrame and retrieve the row directly, reducing the number of operations on the DataFrame.
- Extract the row directly from the result of the query to avoid an additional indexing step.
- Optimize Duplicate Detection: Use pd.Series.duplicated directly on the DataFrame to avoid creating intermediate Series.
- Avoid Redundant DataFrame Creation: Directly filter the DataFrame without creating an intermediate variable for hashes.
- Use set for Uniqueness: Replace groupby with a set to directly obtain unique hash values, which is more efficient and does not require sorting.
- Convert to List: Convert the set back to a list to match the expected return type.
- Use with statement for file operations: This ensures that resources are properly managed and can slightly improve performance by reducing the overhead of manual resource management.
- Optimize exception handling: Move the try-except block to cover only the Image.open call to minimize the scope of exception handling.
- Combine the regular expressions into a single pattern to reduce the number of re.sub calls.
- Use a single re.sub call with a combined pattern to perform the replacements in one pass.
- Combine the regular expressions into a single substitution to reduce the number of passes over the input string.
- Use a more efficient pattern that captures both hyphens and underscores in one go.
- Combine duplicate space removal steps: Remove the redundant second call to remove duplicate spaces by combining it with the first call.
- Use regex for multiple replacements: Use a single regex substitution to handle multiple cleanup tasks in one pass.
- Combine Redundant Checks: Combine checks for md.cover_date and md.series to avoid redundant evaluations.
- Optimize replace_token Calls: Group replace_token calls to minimize the number of times the method is invoked.
- Simplify Month Name Calculation: Simplify the logic for calculating the month name.
@bpepple bpepple self-assigned this Jul 4, 2024
@bpepple bpepple added the chore Miscellaneous drudgery label Jul 4, 2024
bpepple added 2 commits July 6, 2024 10:56
- Should give better information to the user on slow IO operations.
@bpepple bpepple merged commit 6836fed into main Jul 6, 2024
11 checks passed
@bpepple bpepple deleted the dev branch July 6, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Miscellaneous drudgery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant