-
-
Notifications
You must be signed in to change notification settings - Fork 204
Scrape: retrieve video meta data
To scrape is the action of retrieving video meta data relying on two web services: thetvdb.com and tmdb.com.
If you experience some issue with NOVA scraper, you can use the following simple test program https://github.com/nova-video-player/TestScraper to troubleshot what is going wrong and propose some enhancements.
In the code ShowScraper2.java and MovieScraper2.java are performing this task based on the filename analysed after a pre-processing that performs a cleaning rejecting known bad text patterns.
This cleaning process is performed here MovieDefaultMatcher.java
Languages supported are registered in BaseScraper2.java
For TV shows, Nova has switched to thetvdb-java library to retrieve TV show metadata from thetvdb.com.
Changes made on thetvdb.com backend on 02/11/2019 created many scrape issues cf. https://forums.thetvdb.com/viewtopic.php?f=122&t=60239
The following changes have been made in the TV show scrape process to provide better scrape results, the results of the search via thetvdb are split in several categories:
- shows without a valid poster
- shows with numeric slug with valid poster
- shows with a valid poster and non numeric slug
The list of shows is then ordered with list 3 first then 2 then 1. List 3 is further processed by ordering the list with the first elements minimizing the Levenshtein distance between the pre-processed video file named and the results provided by thetvdb.
It helps a lot for shows like https://www.thetvdb.com/search?query=white%20collar
Note that the Levenshtein distance might cause some problems when using multi-lingual search. For instance in French "White Collar" returns "FBI: Duo très spécial" which obviously has a large Levenshtein distance. This problem has been recently solved via computing both locale language and english Levenshtein metrics and selecting showID that has the minimum Levenshtein distance and rematching show title in local language. A preferable mitigation to this issue would be that thetvdb backend itself sorts the results by popularity. Such request has been made to thetvdb here https://gitlab.thetvdb.com/site/thetvdb_api/issues/75 and there https://forums.thetvdb.com/viewtopic.php?f=17&t=60976