Skip to content

Week 05 and 06

AlvaroJoseLopes edited this page Aug 9, 2023 · 2 revisions

TL;DR

During weeks 5 and 6, I wrapped up the data enriching step for LastFM and started coding the framework. I was able to:

  • Implement the enriching step of LastFM
  • Implement the necessary methods to convert the standard .csv files to a heterogenous network, using NetworkX.

Enriching LastFM

This time I used a different approach to find the most useful properties for LastFM. The approach consisted in finding the most common DBpedia properties among the item resources (musical artists and bands).

After choosing the most important properties, a SPARQL query was used to retrieve those properties for each dataset.

Finding the most (probably) most useful properties

A query to retrieve the count of properties was built to find the most common properties among the artists/bands. The query is structured to find properties, their types, and the count of occurrences for each property within the specified resources. The resources are limited to those classified as "musicalArtist" or "Band" according to the DBpedia ontology, since the same resource can be classified in more than one type.

The query, limited to two resources examples, is given below:

PREFIX rdf:	 <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?property ?type (COUNT(?resource) AS ?count)
WHERE {
    VALUES ?type { dbo:musicalArtist dbo:Band }
    {
        <http://dbpedia.org/resource/The_Jam> ?property ?value .
        <http://dbpedia.org/resource/The_Jam> rdf:type ?type .
        BIND(<http://dbpedia.org/resource/The_Jam> AS ?resource)
    }
    UNION
    {
        <http://dbpedia.org/resource/Edgar_Froese> ?property ?value.
        <http://dbpedia.org/resource/Edgar_Froese> rdf:type ?type .
        BIND(<http://dbpedia.org/resource/Edgar_Froese> AS ?resource)
    }
    UNION
    {
        same query for other resource...
    }
    UNION
    ...
}
GROUP BY ?type ?property

The complete script to retrieve this information can be found in this gist.

Enrich query

In the case of LastFm dataset, the most common properties are related to music genre, recorder, awards, associated artists/bands and more. The template query is:

 PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX dbo:  <http://dbpedia.org/ontology/>
PREFIX dbr:  <http://dbpedia.org/resource/>
PREFIX rdf:	 <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT
    ?abstract
    (GROUP_CONCAT(DISTINCT ?bandMember; SEPARATOR="::") AS ?bandMember)
    (GROUP_CONCAT(DISTINCT ?genre; SEPARATOR="::") AS ?genre)
    (GROUP_CONCAT(DISTINCT ?associatedMusicalArtist; SEPARATOR="::") AS ?associatedMusicalArtist)
    (GROUP_CONCAT(DISTINCT ?awards; SEPARATOR="::") AS ?awards)
    (GROUP_CONCAT(DISTINCT ?recordLabel; SEPARATOR="::") AS ?recordLabel)
    (GROUP_CONCAT(DISTINCT ?associatedBand; SEPARATOR="::") AS ?associatedBand)
    (GROUP_CONCAT(DISTINCT ?origin; SEPARATOR="::") AS ?origin)
WHERE {
    {
        OPTIONAL { <$URI>   dbo:genre           ?genre              }   .
        OPTIONAL { <$URI>   dbo:abstract        ?abstract           }   .
        OPTIONAL { <$URI>   dbp:origin          ?origin             }   .
        OPTIONAL { <$URI>   dbo:recordLabel     ?recordLabel        }   .
        OPTIONAL { <$URI>   dbo:bandMember     ?bandMember        }   .
        OPTIONAL { <$URI>   dbo:associatedMusicalArtist     ?associatedMusicalArtist        }   .
        OPTIONAL { <$URI>   dbo:associatedBand     ?associatedBand        }   .
        OPTIONAL { <$URI>   dbp:awards     ?awards        }   .
                                    
        FILTER(LANG(?abstract) = 'en')
    }
    UNION
    {
        <$URI> dbo:wikiPageRedirects ?uri .
        OPTIONAL { ?uri   dbo:genre           ?genre              }   .
        OPTIONAL { ?uri   dbo:abstract        ?abstract           }   .
        OPTIONAL { ?uri   dbp:origin          ?origin             }   .
        OPTIONAL { ?uri   dbo:recordLabel     ?recordLabel        }   .
        OPTIONAL { ?uri   dbo:bandMember     ?bandMember        }   .
        OPTIONAL { ?uri   dbo:associatedMusicalArtist     ?associatedMusicalArtist        }   .
        OPTIONAL { ?uri   dbo:associatedBand     ?associatedBand        }   .
        OPTIONAL { ?uri   dbp:awards     ?awards        }   .
                                    
        FILTER(LANG(?abstract) = 'en')
                    
    }
} 

The inclusion of redirected properties is crucial, as it allows us to access information about certain resources that would otherwise be inaccessible. By following these redirects, we can ensure that we gather the intended properties.

The resulting enriched dataset has the following statistics:

  • number of entities with the property item_id: 11783 (100.00%)
  • number of entities with the property abstract: 11007 (93.41%)
  • number of entities with the property bandMember: 2444 (20.74%)
  • number of entities with the property genre: 8718 (73.99%)
  • number of entities with the property associatedMusicalArtist: 3919 (33.26%)
  • number of entities with the property awards: 146 (1.24%)
  • number of entities with the property recordLabel: 7238 (61.43%)
  • number of entities with the property associatedBand: 3919 (33.26%)
  • number of entities with the property origin: 7069 (59.99%)

Converting datasets to a heterogeneous network

The objective of the framework is to enable users to easily configure an entire experiment pipeline. For example, with the below .yaml file the user could load an enriched MovieLens dataset:

experiment:
  dataset: 
    name: ml-100k
    item: 
      path: datasets/ml-100k/processed/item.csv 
      extra_features: [movie_year, movie_title] 
    user: 
      path: datasets/ml-100k/processed/user.csv 
      extra_features: [gender, occupation] 
    ratings: 
      path: datasets/ml-100k/processed/rating.csv 
      timestamp: True
    enrich:
      map_path: datasets/ml-100k/processed/map.csv
      enrich_path: datasets/ml-100k/processed/enriched.csv
      remove_unmatched: True
      properties:
        - type: subject
          grouped: True
          sep: "::"
        - type: director
          grouped: True
          sep: "::"

Let's break down the main directives for the dataset:

  • item: specifies the item info to be added to the network. (mandatory)
    • path: filepath of the standardized item.csv. (mandatory)
    • extra_features: For default, the only column to be added is the item_id. With a list of column names the user can specify additional features to be added as property node. (optional)
  • user: specifies the user info. (mandatory)
    • path: filepath of the standardized user.csv. (mandatory)
    • extra_features: For default, the only column to be added is the item_id. With a list of column names the user can specify additional features to be added as property node. (optional)
  • ratings: specifies the ratings info. (mandatory)
    • path: filepath of the standardized ratings.csv. (mandatory)
    • timestamp: boolean that indicates if the column timestamp is present.
  • enrich: specifies the enriched info. (mandatory)
    • map_path: filepath of the standardized map.csv. (mandatory)
    • enrich_path: filepath of the standardized user.csv. (mandatory)
    • remove_unmatched: boolean to specify if nodes unmatched with DBpedia should be removed. (mandatory)
    • properties: list of properties to enrich the dataset (mandatory)
      • type: column name (type) of the property (mandatory)
      • grouped: boolean that indicates if the property was grouped and concatenated into a single string. Used for multiples property values of the same property type for a given resource. (mandatory)
      • sep: separator used to concatenate a list of property values. (optional)

PyYaml library was used to deserialize the .yaml file and convert the configuration into a python dictionary. With the configuration provided by the user, it's possible to load and convert the data into a NetworkX graph as specified.

Converting to a network

All Recommender System datasets will be modeled as a heterogenous network, with nodes of type UserNode, ItemNode, and PropertyNode. A rating from a user to an item will be represented as an edge between those nodes (the timestamp of the rating, is a property of this edge). On the other hand, an edge between an item and a property will indicate that this item has this property. And finally, an edge between two different users will indicate a social link between them.

The class Graph is a wrapper on the top of nx.Graph() that receives the dataset configuration and converts it as specified.

The following image is a sample of the network specified above:

netowork-example

Next Steps

Analyze the most used pre-processing, filtering, and splitting methods, then implement them.

Clone this wiki locally