Show related publications #69

porduna · 2014-04-16T07:33:48Z

It would be interesting to list 3-5 related publications, based on the tags.

A simple implementation might not be too difficult or inefficient. Basically, in 2 queries (assuming that we already have the list of tag_ids of the current publication):

# Group tags by publication identifier
tags_by_publication_id = defaultdict(set)
for publication_id, tag_id in PublicationTag.objects.values("publication_id", "tag_id"):
      tags_by_publication[publication_id].add(tag_id)

current_tag_ids = set(THE_TAG_IDS_OF_THE_CURRENT_PUBLICATION)

sorted_publications = [] # ( publication_id, value )
for publication_id, set_of_tags in tags_by_publication_id.:
     same_tags = set_of_tags.intersection(current_tag_ids)
     # This could be something more advance:
     # One tag that appears twice might (or might not) have a higher value
     # than one that appears 10 times
     value = len(same_tags)
     sorted_publications.append( (publication_id, value )

if sorted_publications:
    sorted_publications.sort(lambda (publication_id1, value1), (publication_id2, value2) : cmp(value2, value1) )

    target_publication_ids = [ publication_id for publication_id, value in sorted_publications[:5] ]

    # Retrieve data from target_publication_ids
    publications = Publication.objects.filter(id__in = target_publication_ids)
else:
    publications = []

porduna · 2014-04-16T07:34:26Z

Does anybody think the algorithm should be something more complex (e.g., counting also authors, or assigning different values to the different tags)?

aitoralmeida · 2014-04-16T07:41:22Z

We can put the authors and tags in a set and compute the Jaccard index (http://en.wikipedia.org/wiki/Jaccard_index) between the papers. Its also easy to implement

porduna · 2014-04-16T08:08:21Z

If I understand it, the difference is that instead of doing:

     same_tags = set_of_tags.intersection(current_tag_ids)
     # This could be something more advance:
     # One tag that appears twice might (or might not) have a higher value
     # than one that appears 10 times
     value = len(same_tags)

We do:

     intersection = set_of_tags.intersection(current_tag_ids)
     union = set_of_tags.union(current_tag_ids)
     value = intersection / union

Is that right?

aitoralmeida · 2014-04-16T09:13:19Z

Yes, but to take also the authors into account we can add their IDs to the set (used_set = author_ids + tag_ids)

porduna · 2014-04-16T10:05:32Z

I'm thinking that maybe I'll implement a number of options and provide them as options with queries (e.g., publications/<publication_slug>/?related=withauthors&related_method=jaccard), and even not show related papers (only with those methods). Then, we can evaluate all this with publications and see which options are better for tuning it.

aitoralmeida · 2014-06-20T16:31:21Z

We can do something similar to what we have done to the related persons in #78

aitoralmeida added statistics/visualization labels Jun 20, 2014

aitoralmeida added low and removed high labels Jan 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show related publications #69

Show related publications #69

porduna commented Apr 16, 2014

porduna commented Apr 16, 2014

aitoralmeida commented Apr 16, 2014

porduna commented Apr 16, 2014

aitoralmeida commented Apr 16, 2014

porduna commented Apr 16, 2014

aitoralmeida commented Jun 20, 2014

Show related publications #69

Show related publications #69

Comments

porduna commented Apr 16, 2014

porduna commented Apr 16, 2014

aitoralmeida commented Apr 16, 2014

porduna commented Apr 16, 2014

aitoralmeida commented Apr 16, 2014

porduna commented Apr 16, 2014

aitoralmeida commented Jun 20, 2014