Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show related publications #69

Open
porduna opened this issue Apr 16, 2014 · 6 comments
Open

Show related publications #69

porduna opened this issue Apr 16, 2014 · 6 comments

Comments

@porduna
Copy link
Collaborator

porduna commented Apr 16, 2014

It would be interesting to list 3-5 related publications, based on the tags.

A simple implementation might not be too difficult or inefficient. Basically, in 2 queries (assuming that we already have the list of tag_ids of the current publication):

# Group tags by publication identifier
tags_by_publication_id = defaultdict(set)
for publication_id, tag_id in PublicationTag.objects.values("publication_id", "tag_id"):
      tags_by_publication[publication_id].add(tag_id)

current_tag_ids = set(THE_TAG_IDS_OF_THE_CURRENT_PUBLICATION)

sorted_publications = [] # ( publication_id, value )
for publication_id, set_of_tags in tags_by_publication_id.:
     same_tags = set_of_tags.intersection(current_tag_ids)
     # This could be something more advance:
     # One tag that appears twice might (or might not) have a higher value
     # than one that appears 10 times
     value = len(same_tags)
     sorted_publications.append( (publication_id, value )

if sorted_publications:
    sorted_publications.sort(lambda (publication_id1, value1), (publication_id2, value2) : cmp(value2, value1) )

    target_publication_ids = [ publication_id for publication_id, value in sorted_publications[:5] ]

    # Retrieve data from target_publication_ids
    publications = Publication.objects.filter(id__in = target_publication_ids)
else:
    publications = []
@porduna
Copy link
Collaborator Author

porduna commented Apr 16, 2014

Does anybody think the algorithm should be something more complex (e.g., counting also authors, or assigning different values to the different tags)?

@aitoralmeida
Copy link
Collaborator

We can put the authors and tags in a set and compute the Jaccard index (http://en.wikipedia.org/wiki/Jaccard_index) between the papers. Its also easy to implement

@porduna
Copy link
Collaborator Author

porduna commented Apr 16, 2014

If I understand it, the difference is that instead of doing:

     same_tags = set_of_tags.intersection(current_tag_ids)
     # This could be something more advance:
     # One tag that appears twice might (or might not) have a higher value
     # than one that appears 10 times
     value = len(same_tags)

We do:

     intersection = set_of_tags.intersection(current_tag_ids)
     union = set_of_tags.union(current_tag_ids)
     value = intersection / union

Is that right?

@aitoralmeida
Copy link
Collaborator

Yes, but to take also the authors into account we can add their IDs to the set (used_set = author_ids + tag_ids)

@porduna
Copy link
Collaborator Author

porduna commented Apr 16, 2014

I'm thinking that maybe I'll implement a number of options and provide them as options with queries (e.g., publications/<publication_slug>/?related=withauthors&related_method=jaccard), and even not show related papers (only with those methods). Then, we can evaluate all this with publications and see which options are better for tuning it.

@aitoralmeida
Copy link
Collaborator

We can do something similar to what we have done to the related persons in #78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants