Written for NTU CE4034/CZ4034 Information Retrieval. Tabulates tf-idf results based on typical tutorial/examination question formats.
When I made this commit, we've already submitted the Past Year Paper solutions, nonetheless, if you visit this space, you're in luck because there's another tool in this space on top of the TF-IDF calculator. I also made an Excel sheet that performs Levenshtein Calculation with Insertion, Deletion & Substitution cost modifiable. Refer to AY2021 Semester 2 Examination Solutions for explanation on Levenshtein equation.
This repository takes in terms, documents, its term frequencies, and a query, to produce the tf-idf values of each document, as well as the Cosine Similarity & Euclidean Distance to the query based on the SMART notation given. The SMART features implemented in this script is limited to those that are covered in CX4034 . They are:
Term Frequency | Doc Frequency | Normalisation |
---|---|---|
n: natural | n: natural | n: natural |
l: logarithmic | t: inversed doc freq | c: cosine normalisation |
a: augmented |
Feel free to implement your own features should you require to.
Before you begin, ensure that you have pandas & numpy installed on your python environment. If you do not, execute pip install -r requirements.txt
. There is two ways for you to run the script; 1) Directly through .py
file, or 2) Through Jupyter Notebook.
- Input your documents & term frequencies in
tfidf-input.xlsx
. Please do not stray from the existing format. - Execute
python tfidfgen.py
, or opentfidfgen.ipynb
on Jupyter Notebook and run the cells in order. - You will be prompted to input the SMART notation (i.e. ntc.atc).
- You will be prompted to input the query. Note that if query contain terms that does not exist within
tfidf-input.xlsx
, a WARNING will be printed. - Cosine Similarity & Euclidean Distance will be printed, and the tf-idf values will be exported to
tfidf-results.xlsx
There are example values tfidf-input.xlsx
and tfidf-results.xlsx
, based on AY1819 Semester 2 Examination Question.
Example Input
Example Output
- I understand the code is not the cleanest, neither is it optimised, it was a pet project written in 1 day so please be forgiving.
- For Term Frequency (Logarithmic), some PYP uses
1+log(tf)
while others uselog(tf+1)
. You may switch between either by uncommenting and commenting the appropriate lines inif tfnotation =='l':
case insidetfCalc()
function.