Improve TF-IDF agent by tuning matches threshold #95

xavierfigueroav · 2022-03-22T09:15:36Z

Hello.

I've been playing around with some parameters of the TF-IDF agent.

I've found that if we stop using a threshold (cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):

atarashi/atarashi/agents/tfidf.py

Lines 124 to 136 in 6cdd410

    
           for counter, value in enumerate(all_documents_matrix, start=0): 
        
             sim_score = self.__cosine_similarity(value, search_martix) 
        
             if sim_score >= 0.3: 
        
               matches.append({ 
        
                 'shortname': self.licenseList.iloc[counter]['shortname'], 
        
                 'sim_type': "TF-IDF Cosine Sim", 
        
                 'sim_score': sim_score, 
        
                 'desc': '' 
        
               }) 
        
           matches.sort(key=lambda x: x['sim_score'], reverse=True) 
        
           if self.verbose > 0: 
        
             print("time taken is " + str(time.time() - startTime) + " sec") 
        
           return matches

Using the evaluation.py script, I've carried out some experiments:

	Algorithm	Time elapsed	Accuracy
1	*tfidf (CosineSim) (thr=0.30)*	*30.19*	*59.0%*
2	tfidf (CosineSim) (thr=0.17)	35.29	61.0%
3	tfidf (CosineSim) (thr=0.16, max_df=0.10)	27.34	62.0%
4	tfidf (CosineSim) (thr=0.16)	36.42	62.0%
5	tfidf (CosineSim) (thr=0.15)	38.45	62.0%
6	tfidf (CosineSim) (thr=0.10)	39.91	62.0%
7	tfidf (CosineSim) (thr=0.00)	61.49	62.0%
8	Ngram (CosineSim)	-	57.0%
9	Ngram (BigramCosineSim)	-	56.0%
10	Ngram (DiceSim)	-	55.0%
11	wordFrequencySimilarity	-	23.0%
12	DLD	-	17.0%
13	tfidf (ScoreSim)	-	13.0%

Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using CosineSim as similarity measure.
Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (cosine similarity >= 0.00). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is 0.16, showed in row 4.
In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the TfidfVectorizer. Setting max_df to 0.10 (default is 1.0) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.
- Why does decreasing the max_df value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than the max_df percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.
- Why does decreasing the max_df value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.

I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.

Important notes:

I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
All the results differ from the last report I could find out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
My findings may help improve other agents that use thresholds, such as Ngram.
This new state-of-atarashi performance 😅 may also push the goals of future agents implementations, since it would be the new baseline.

The text was updated successfully, but these errors were encountered:

GMishx · 2022-03-28T09:55:00Z

That's a very detailed evaluation @xavierfigueroav . Thank you for providing the info.

Maybe, if you can provide a good overview of the baseline, we can put it on our wiki and use it to compare with different solutions (as you mentioned).

This was referenced Mar 22, 2022

perf(TFIDF): Lower the similarity score threshold to keep matches #96

Merged

Make evaluation.py more informative #97

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TF-IDF agent by tuning matches threshold #95

Improve TF-IDF agent by tuning matches threshold #95

xavierfigueroav commented Mar 22, 2022

GMishx commented Mar 28, 2022

Improve TF-IDF agent by tuning matches threshold #95

Improve TF-IDF agent by tuning matches threshold #95

Comments

xavierfigueroav commented Mar 22, 2022

GMishx commented Mar 28, 2022