You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been playing around with some parameters of the TF-IDF agent.
I've found that if we stop using a threshold (cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):
print("time taken is "+str(time.time() -startTime) +" sec")
returnmatches
Using the evaluation.py script, I've carried out some experiments:
Algorithm
Time elapsed
Accuracy
1
tfidf (CosineSim) (thr=0.30)
30.19
59.0%
2
tfidf (CosineSim) (thr=0.17)
35.29
61.0%
3
tfidf (CosineSim) (thr=0.16, max_df=0.10)
27.34
62.0%
4
tfidf (CosineSim) (thr=0.16)
36.42
62.0%
5
tfidf (CosineSim) (thr=0.15)
38.45
62.0%
6
tfidf (CosineSim) (thr=0.10)
39.91
62.0%
7
tfidf (CosineSim) (thr=0.00)
61.49
62.0%
8
Ngram (CosineSim)
-
57.0%
9
Ngram (BigramCosineSim)
-
56.0%
10
Ngram (DiceSim)
-
55.0%
11
wordFrequencySimilarity
-
23.0%
12
DLD
-
17.0%
13
tfidf (ScoreSim)
-
13.0%
Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using CosineSim as similarity measure.
Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (cosine similarity >= 0.00). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is 0.16, showed in row 4.
In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the TfidfVectorizer. Setting max_df to 0.10 (default is 1.0) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.
Why does decreasing the max_df value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than the max_df percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.
Why does decreasing the max_df value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.
I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.
Important notes:
I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
All the results differ from the last report I could find out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
My findings may help improve other agents that use thresholds, such as Ngram.
This new state-of-atarashi performance 😅 may also push the goals of future agents implementations, since it would be the new baseline.
The text was updated successfully, but these errors were encountered:
That's a very detailed evaluation @xavierfigueroav . Thank you for providing the info.
Maybe, if you can provide a good overview of the baseline, we can put it on our wiki and use it to compare with different solutions (as you mentioned).
Hello.
I've been playing around with some parameters of the TF-IDF agent.
I've found that if we stop using a threshold (
cosine similarity >= 0.30
) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):atarashi/atarashi/agents/tfidf.py
Lines 124 to 136 in 6cdd410
Using the
evaluation.py
script, I've carried out some experiments:cosine similarity >= 0.00
). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is0.16
, showed in row 4.max_df
to0.10
(default is1.0
) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.max_df
value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than themax_df
percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.max_df
value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.
Important notes:
The text was updated successfully, but these errors were encountered: