Performance Refactoring #128

bpepple · 2024-07-04T21:49:25Z

Some refactoring to improve performance.

… reduce redundancy.

- Reduce I/O Operations: Instead of opening each image inside the inner loop, we can open all images at once and process them. - Optimize Looping: Use list comprehensions where possible to make the code more efficient and Pythonic. - Error Handling: Move error handling outside the inner loop to avoid repeated checks.

- Used the apply method to directly create DuplicateIssue objects for each matching row, eliminating the need for an explicit loop. - Used list comprehension to collect the results, which is more efficient in terms of performance.

- Used List Comprehension for Filtering: Instead of using a for-loop, use list comprehension to filter out the items where pages are successfully removed. - Batch Processing: Process the removal and feedback in a single loop to reduce the overhead of multiple function calls.

- Used query to filter the DataFrame and retrieve the row directly, reducing the number of operations on the DataFrame. - Extract the row directly from the result of the query to avoid an additional indexing step.

- Optimize Duplicate Detection: Use pd.Series.duplicated directly on the DataFrame to avoid creating intermediate Series. - Avoid Redundant DataFrame Creation: Directly filter the DataFrame without creating an intermediate variable for hashes.

- Use set for Uniqueness: Replace groupby with a set to directly obtain unique hash values, which is more efficient and does not require sorting. - Convert to List: Convert the set back to a list to match the expected return type.

- Use with statement for file operations: This ensures that resources are properly managed and can slightly improve performance by reducing the overhead of manual resource management. - Optimize exception handling: Move the try-except block to cover only the Image.open call to minimize the scope of exception handling.

- Combine the regular expressions into a single pattern to reduce the number of re.sub calls. - Use a single re.sub call with a combined pattern to perform the replacements in one pass.

- Combine the regular expressions into a single substitution to reduce the number of passes over the input string. - Use a more efficient pattern that captures both hyphens and underscores in one go.

- Combine duplicate space removal steps: Remove the redundant second call to remove duplicate spaces by combining it with the first call. - Use regex for multiple replacements: Use a single regex substitution to handle multiple cleanup tasks in one pass.

- Combine Redundant Checks: Combine checks for md.cover_date and md.series to avoid redundant evaluations. - Optimize replace_token Calls: Group replace_token calls to minimize the number of times the method is invoked. - Simplify Month Name Calculation: Simplify the logic for calculating the month name.

- Should give better information to the user on slow IO operations.

bpepple added 21 commits July 1, 2024 13:42

Add function to extract comic id

3203289

make extract_id_str private

b3f32b3

Extract the conditional logic for collection into a separate variable

eaca3b6

Move conf init to beginning

6e96444

Extracted the common questionary.print call into a helper function to…

f362d86

… reduce redundancy.

Update pre-commit config

075643e

Ignore PLR0915

068e3f7

Combine repetitive logic and simplify conditional checks

c58ae10

Parse the document only once.

e756112

Refactor get_comic_list_from_hash()

26fc2f3

- Used the apply method to directly create DuplicateIssue objects for each matching row, eliminating the need for an explicit loop. - Used list comprehension to collect the results, which is more efficient in terms of performance.

Refactor get_comic_info_for_distinct_hash()

9e737c2

- Used query to filter the DataFrame and retrieve the row directly, reducing the number of operations on the DataFrame. - Extract the row directly from the result of the query to avoid an additional indexing step.

Refactor _get_page_hashes()

aaa78a0

- Optimize Duplicate Detection: Use pd.Series.duplicated directly on the DataFrame to avoid creating intermediate Series. - Avoid Redundant DataFrame Creation: Directly filter the DataFrame without creating an intermediate variable for hashes.

Refactor get_distinct_hashes()

9f6ce54

- Use set for Uniqueness: Replace groupby with a set to directly obtain unique hash values, which is more efficient and does not require sorting. - Convert to List: Convert the set back to a list to match the expected return type.

Rework duplicate tests

d9fb8d1

Refactor _remove_empty_separators()

83a5726

- Combine the regular expressions into a single pattern to reduce the number of re.sub calls. - Use a single re.sub call with a combined pattern to perform the replacements in one pass.

Refactor _remove_duplicate_hyphen_underscore()

aaf5877

- Combine the regular expressions into a single substitution to reduce the number of passes over the input string. - Use a more efficient pattern that captures both hyphens and underscores in one go.

Refactor smart_cleanup_string()

ce7145a

- Combine duplicate space removal steps: Remove the redundant second call to remove duplicate spaces by combining it with the first call. - Use regex for multiple replacements: Use a single regex substitution to handle multiple cleanup tasks in one pass.

bpepple self-assigned this Jul 4, 2024

bpepple added the chore Miscellaneous drudgery label Jul 4, 2024

bpepple added 2 commits July 6, 2024 10:56

Add dep on tqdm

cea1936

Use progressbar instead of print statements

0c61fac

- Should give better information to the user on slow IO operations.

bpepple merged commit 6836fed into main Jul 6, 2024
11 checks passed

bpepple deleted the dev branch July 6, 2024 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Refactoring #128

Performance Refactoring #128

bpepple commented Jul 4, 2024

Performance Refactoring #128

Performance Refactoring #128

Conversation

bpepple commented Jul 4, 2024