Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro-optimizations to speed up preprocessing by ~2x #214

Merged
merged 1 commit into from
Nov 18, 2023

Conversation

ankurdave
Copy link
Contributor

On my 12,000-transaction ledger, beancount_import.reconcile.LoadedReconciler._preprocess_entries() currently takes 11.84 seconds. I measured and profiled it as follows:

modified   beancount_import/reconcile.py
@@ -442,2 +442,5 @@ class LoadedReconciler(object):
     def _preprocess_entries(self):
+        import cProfile
+        pr = cProfile.Profile()
+        pr.enable()
         posting_db = self.posting_db
@@ -453,2 +456,4 @@ class LoadedReconciler(object):
                 self.balance_entries[key] = entry.amount.number
+        pr.disable()
+        pr.print_stats(sort='time')

This PR contains the following micro-optimizations to matching.py that brought the time down to 5.98 seconds, a 2x speedup:

  1. Avoid duplicate calls to _entry_and_posting_ids_key() in add_posting(), remove_posting(), and _get_matches().
  2. Hoist the call to _search_posting() out of the fuzzy date loop in add_posting() and remove_posting().

@Zburatorul
Copy link
Collaborator

Impressive work @ankurdave. I've never had the stomach for understand the matching logic.

@Zburatorul Zburatorul closed this Nov 18, 2023
@Zburatorul
Copy link
Collaborator

Also, please paste the profiler call table for posterity :)

@Zburatorul Zburatorul reopened this Nov 18, 2023
@Zburatorul
Copy link
Collaborator

Wrong button. Meant to merge, not close.

@Zburatorul Zburatorul merged commit 3bcd35b into jbms:master Nov 18, 2023
19 checks passed
@ankurdave
Copy link
Contributor Author

Before:

         20664200 function calls in 11.972 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1017600    2.767    0.000    4.268    0.000 matching.py:215(_entry_and_posting_ids_key)
    84800    1.272    0.000    9.529    0.000 matching.py:283(add_posting)
   932800    1.229    0.000    6.268    0.000 matching.py:272(_search_posting)
  3924816    1.004    0.000    1.368    0.000 matching.py:216(<genexpr>)
  1017600    0.895    0.000    0.895    0.000 matching.py:256(get_fuzzy_date_range)
  5069708    0.622    0.000    0.622    0.000 {built-in method builtins.id}
   476638    0.493    0.000    0.997    0.000 matching.py:871(posting_set_id)
  1130506    0.432    0.000    0.673    0.000 matching.py:196(get_posting_weight)
  1017600    0.397    0.000    1.292    0.000 matching.py:261(fuzz_date_currency_key)
  1607694    0.380    0.000    0.503    0.000 matching.py:872(<genexpr>)
    57773    0.357    0.000    1.633    0.000 matching.py:884(add_subset)
     8944    0.277    0.000    0.277    0.000 {built-in method builtins.max}
  1263068    0.267    0.000    0.267    0.000 {built-in method __new__ of type object at 0x10e96ffa8}
    12181    0.134    0.000    2.426    0.000 __init__.py:731(wrapper)
  1018444    0.119    0.000    0.119    0.000 {method 'append' of 'list' objects}
    91051    0.089    0.000    0.164    0.000 amount.py:42(__new__)
    96981    0.086    0.000    2.653    0.000 matching.py:930(get_matchable_postings)
     9968    0.078    0.000    0.078    0.000 {method 'items' of 'collections.OrderedDict' objects}
    12181    0.077    0.000   12.370    0.001 matching.py:375(add_transaction)
    57773    0.075    0.000    0.129    0.000 {built-in method builtins.sum}
    84800    0.074    0.000    0.202    0.000 matching.py:265(_date_currency_key)
    34304    0.066    0.000    0.178    0.000 booking_full.py:680(convert_costspec_to_cost)
     9968    0.062    0.000    1.784    0.000 matching.py:856(_get_uncleared_aggregate_posting_candidates)
   273197    0.054    0.000    0.054    0.000 matching.py:885(<genexpr>)
    84800    0.048    0.000    0.075    0.000 matching.py:191(is_unknown_account)
    12181    0.045    0.000    2.513    0.000 matching.py:812(get_aggregate_posting_candidates)
    84800    0.041    0.000    0.082    0.000 matching.py:211(_date_key)
    84800    0.034    0.000    0.054    0.000 <string>:1(<lambda>)
    84800    0.034    0.000    0.041    0.000 posting_date.py:5(get_posting_date)
    34304    0.033    0.000    0.085    0.000 __init__.py:452(_replace)
     9968    0.033    0.000    0.433    0.000 __init__.py:72(__setitem__)
   258325    0.033    0.000    0.033    0.000 {built-in method builtins.isinstance}
    12181    0.027    0.000    0.092    0.000 matching.py:927(<listcomp>)
    84793    0.027    0.000    0.027    0.000 {method 'startswith' of 'str' objects}
    21125    0.023    0.000    0.054    0.000 __init__.py:179(__getitem__)
    21125    0.023    0.000    0.028    0.000 __init__.py:66(__getitem__)
    34304    0.020    0.000    0.052    0.000 __init__.py:442(_make)
     8944    0.019    0.000    0.312    0.000 __init__.py:610(most_common)
    24714    0.019    0.000    0.029    0.000 reconcile.py:460(is_posting_cleared)
    97051    0.017    0.000    0.017    0.000 {method 'get' of 'dict' objects}
   106943    0.015    0.000    0.015    0.000 {built-in method builtins.len}
     9968    0.015    0.000    0.449    0.000 __init__.py:185(__setitem__)
     8944    0.015    0.000    0.396    0.000 __init__.py:193(popitem)
    12181    0.015    0.000    0.024    0.000 matching.py:856(<lambda>)
     8944    0.015    0.000    0.069    0.000 __init__.py:111(pop)
     8944    0.012    0.000    0.291    0.000 heapq.py:523(nlargest)
    12181    0.010    0.000    0.111    0.000 matching.py:965(get_matchable_postings_from_transaction)
     8944    0.010    0.000    0.012    0.000 __init__.py:88(__delitem__)
     8944    0.010    0.000    0.032    0.000 __init__.py:189(__delitem__)
    40287    0.010    0.000    0.010    0.000 matching.py:949(<genexpr>)
     8944    0.010    0.000    0.010    0.000 __init__.py:728(__delitem__)
    12181    0.008    0.000    0.101    0.000 matching.py:926(get_weighted_postings)
    26017    0.007    0.000    0.009    0.000 matching.py:856(<genexpr>)
    20101    0.006    0.000    0.006    0.000 __init__.py:93(__contains__)
     9968    0.006    0.000    0.006    0.000 __init__.py:96(__missing__)
    13836    0.006    0.000    0.006    0.000 {method 'setdefault' of 'collections.OrderedDict' objects}
    11583    0.005    0.000    0.005    0.000 ofx.py:1528(is_posting_cleared)
     9968    0.002    0.000    0.002    0.000 __init__.py:35(__setitem__)
     8944    0.002    0.000    0.002    0.000 {built-in method builtins.iter}
     8944    0.002    0.000    0.002    0.000 {method 'items' of 'dict' objects}
     9968    0.002    0.000    0.002    0.000 __init__.py:138(getsizeof)
     6328    0.002    0.000    0.002    0.000 {built-in method builtins.abs}
     9968    0.001    0.000    0.001    0.000 __init__.py:601(__missing__)
     8944    0.001    0.000    0.001    0.000 __init__.py:38(pop)
      521    0.001    0.000    0.001    0.000 {method 'extend' of 'list' objects}
      294    0.000    0.000    0.000    0.000 {built-in method builtins.hash}
      300    0.000    0.000    0.001    0.000 {method 'add' of 'set' objects}
      342    0.000    0.000    0.000    0.000 {built-in method builtins.min}
      379    0.000    0.000    0.000    0.000 venmo_json_source.py:268(is_posting_cleared)
      857    0.000    0.000    0.000    0.000 costco_receipt_source.py:237(is_posting_cleared)
       20    0.000    0.000    0.000    0.000 matching.py:874(partition)
      117    0.000    0.000    0.000    0.000 adp_payroll_source.py:195(is_posting_cleared)
      294    0.000    0.000    0.000    0.000 amount.py:101(__hash__)
      115    0.000    0.000    0.000    0.000 cashapp_csv_source.py:209(is_posting_cleared)
      269    0.000    0.000    0.000    0.000 matching.py:907(<lambda>)
       37    0.000    0.000    0.000    0.000 workday_payroll_source.py:211(is_posting_cleared)
       14    0.000    0.000    0.000    0.000 amount.py:149(mul)
       44    0.000    0.000    0.000    0.000 boa_mortgage_csv_source.py:175(is_posting_cleared)
        3    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}
        8    0.000    0.000    0.000    0.000 amount.py:83(__eq__)
        1    0.000    0.000    0.000    0.000 journal_editor.py:296(all_entries)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

After:

         9725404 function calls in 5.913 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    84800    0.897    0.000    3.127    0.000 matching.py:283(add_posting)
  1017600    0.766    0.000    0.766    0.000 matching.py:256(get_fuzzy_date_range)
   476638    0.670    0.000    1.164    0.000 matching.py:871(posting_set_id)
  1607694    0.372    0.000    0.494    0.000 matching.py:872(<genexpr>)
  1017600    0.369    0.000    1.134    0.000 matching.py:261(fuzz_date_currency_key)
    57773    0.347    0.000    1.784    0.000 matching.py:884(add_subset)
    84800    0.347    0.000    0.482    0.000 matching.py:215(_entry_and_posting_ids_key)
     8944    0.254    0.000    0.254    0.000 {built-in method builtins.max}
  1471960    0.169    0.000    0.169    0.000 {built-in method builtins.id}
    21125    0.166    0.000    0.193    0.000 __init__.py:179(__getitem__)
  1018444    0.112    0.000    0.112    0.000 {method 'append' of 'list' objects}
   282506    0.109    0.000    0.185    0.000 matching.py:196(get_posting_weight)
    84800    0.103    0.000    0.198    0.000 matching.py:272(_search_posting)
   327068    0.090    0.000    0.123    0.000 matching.py:216(<genexpr>)
   343628    0.079    0.000    0.079    0.000 {built-in method __new__ of type object at 0x1008c6fa8}
    96981    0.077    0.000    2.703    0.000 matching.py:930(get_matchable_postings)
    57773    0.072    0.000    0.125    0.000 {built-in method builtins.sum}
    84800    0.070    0.000    0.193    0.000 matching.py:265(_date_currency_key)
    12181    0.065    0.000    5.999    0.000 matching.py:376(add_transaction)
    67231    0.064    0.000    0.117    0.000 amount.py:42(__new__)
     9968    0.057    0.000    1.853    0.000 matching.py:856(_get_uncleared_aggregate_posting_candidates)
   273197    0.054    0.000    0.054    0.000 matching.py:885(<genexpr>)
    84800    0.042    0.000    0.064    0.000 matching.py:191(is_unknown_account)
    12181    0.040    0.000    2.576    0.000 matching.py:812(get_aggregate_posting_candidates)
    84800    0.038    0.000    0.079    0.000 matching.py:211(_date_key)
    12181    0.033    0.000    2.497    0.000 __init__.py:731(wrapper)
    84800    0.033    0.000    0.040    0.000 posting_date.py:5(get_posting_date)
    84800    0.033    0.000    0.050    0.000 <string>:1(<lambda>)
     9968    0.030    0.000    0.393    0.000 __init__.py:72(__setitem__)
    12181    0.025    0.000    0.086    0.000 matching.py:927(<listcomp>)
   186855    0.023    0.000    0.023    0.000 {built-in method builtins.isinstance}
    84793    0.022    0.000    0.022    0.000 {method 'startswith' of 'str' objects}
    10494    0.021    0.000    0.058    0.000 booking_full.py:680(convert_costspec_to_cost)
    21125    0.019    0.000    0.025    0.000 __init__.py:66(__getitem__)
    24714    0.018    0.000    0.028    0.000 reconcile.py:460(is_posting_cleared)
    12181    0.017    0.000    0.027    0.000 matching.py:856(<lambda>)
     8944    0.017    0.000    0.286    0.000 __init__.py:610(most_common)
    97051    0.017    0.000    0.017    0.000 {method 'get' of 'dict' objects}
     9968    0.014    0.000    0.408    0.000 __init__.py:185(__setitem__)
     8944    0.013    0.000    0.061    0.000 __init__.py:111(pop)
     8944    0.012    0.000    0.359    0.000 __init__.py:193(popitem)
     8944    0.012    0.000    0.268    0.000 heapq.py:523(nlargest)
    10494    0.012    0.000    0.029    0.000 __init__.py:452(_replace)
    83133    0.011    0.000    0.011    0.000 {built-in method builtins.len}
    12181    0.010    0.000    0.103    0.000 matching.py:965(get_matchable_postings_from_transaction)
    40287    0.009    0.000    0.009    0.000 matching.py:949(<genexpr>)
     8944    0.009    0.000    0.010    0.000 __init__.py:88(__delitem__)
     8944    0.009    0.000    0.009    0.000 __init__.py:728(__delitem__)
     8944    0.008    0.000    0.027    0.000 __init__.py:189(__delitem__)
    26017    0.008    0.000    0.010    0.000 matching.py:856(<genexpr>)
    12181    0.007    0.000    0.093    0.000 matching.py:926(get_weighted_postings)
    10494    0.007    0.000    0.017    0.000 __init__.py:442(_make)
     9968    0.006    0.000    0.006    0.000 __init__.py:96(__missing__)
    20101    0.006    0.000    0.006    0.000 __init__.py:93(__contains__)
    13836    0.005    0.000    0.005    0.000 {method 'setdefault' of 'collections.OrderedDict' objects}
    11583    0.005    0.000    0.005    0.000 ofx.py:1528(is_posting_cleared)
     9968    0.002    0.000    0.002    0.000 __init__.py:35(__setitem__)
     9968    0.002    0.000    0.002    0.000 {method 'items' of 'collections.OrderedDict' objects}
     8944    0.002    0.000    0.002    0.000 {built-in method builtins.iter}
     9968    0.002    0.000    0.002    0.000 __init__.py:138(getsizeof)
     8944    0.002    0.000    0.002    0.000 {method 'items' of 'dict' objects}
     8944    0.002    0.000    0.002    0.000 __init__.py:38(pop)
     9968    0.001    0.000    0.001    0.000 __init__.py:601(__missing__)
      521    0.001    0.000    0.001    0.000 {method 'extend' of 'list' objects}
     1808    0.000    0.000    0.000    0.000 {built-in method builtins.abs}
      294    0.000    0.000    0.000    0.000 {built-in method builtins.hash}
      300    0.000    0.000    0.001    0.000 {method 'add' of 'set' objects}
      342    0.000    0.000    0.000    0.000 {built-in method builtins.min}
      857    0.000    0.000    0.000    0.000 costco_receipt_source.py:237(is_posting_cleared)
      379    0.000    0.000    0.000    0.000 venmo_json_source.py:268(is_posting_cleared)
      294    0.000    0.000    0.000    0.000 amount.py:101(__hash__)
       20    0.000    0.000    0.000    0.000 matching.py:874(partition)
      117    0.000    0.000    0.000    0.000 adp_payroll_source.py:195(is_posting_cleared)
      269    0.000    0.000    0.000    0.000 matching.py:907(<lambda>)
      115    0.000    0.000    0.000    0.000 cashapp_csv_source.py:209(is_posting_cleared)
       37    0.000    0.000    0.000    0.000 workday_payroll_source.py:211(is_posting_cleared)
        4    0.000    0.000    0.000    0.000 amount.py:149(mul)
       44    0.000    0.000    0.000    0.000 boa_mortgage_csv_source.py:175(is_posting_cleared)
        3    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}
        8    0.000    0.000    0.000    0.000 amount.py:83(__eq__)
        1    0.000    0.000    0.000    0.000 journal_editor.py:296(all_entries)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants