-
Notifications
You must be signed in to change notification settings - Fork 1
Data quality
Note: the text from the previous version of this page has been moved to the discussion page for now.
Data quality and quality assurance (QA) processing and information for data on CKAN (usually as encapsulated by Resources).
Examples:
- Does this resource exist (i.e. not 404), is the API up
- does it conform to a schema (if it has a schema)
The CKAN QA extension can be used to provide basic QA information about each resource in a CKAN instance (see the Readme file in the repository for installation instructions). For each resource, the following data is currently calculated:
- openness score: A number between 0 and 5 (inclusive) based on Tim Berners-Lee's 5 five star scheme. For more information on this see http://lab.linkeddata.deri.ie/2010/star-scheme-by-example
- openness score reason: A string of text giving the reason that the particular star rating was chosen.
- failure count: The number of consecutive times that a given resource has received a score of 0.
The current process for calculating an openness score rating is as follows:
- A HEAD request is sent to the resource URL.
- If this fails for any reason, the resource is given an openness score of 0.
- Next we try to calculate content type of the resource.
- Firstly, we try to guess the MIME type from the CKAN resource object based on the file extension.
- If this fails, we then try to read the content-type header from the HEAD request response.
- If this also fails, we finally try to read the format field of the CKAN resource object.
- If all 3 attempts fail, the resource is given a score of 0.
- If we have a content type, the openness score is calculated from the following table:
Openness Score | Content Type |
---|---|
1 | text/plain |
1 | text |
1 | txt |
2 | application/vnd.ms-excel |
2 | application/vnd.ms-excel.sheet.binary.macroenabled.12 |
2 | application/vnd.ms-excel.sheet.macroenabled.12 |
2 | application/vnd.openxmlformats-officedocument.spreadsheet.sheet |
2 | xls |
3 | text/csv |
3 | application/json |
3 | text/xml |
3 | csv |
3 | xml |
3 | json |
4 | application/rdf+xml |
4 | rdf |
- The openness score reason is currently one of the following messages:
Openness Score | Openness Score Reason |
---|---|
0 | unrecognised content type |
0 | not obtainable |
1 | obtainable via web page |
2 | machine readable format |
3 | open and standardized format |
4 | ontologically represented |
5 | fully Linked Open Data as appropriate |
QA information is currently updated in two ways:
- Manually from the command line (see the extension Readme for details). This can therefore be scheduled as a CRON job and called regularly.
- New in CKAN 1.5.1: QA information is calculated automatically (in the background) for individual CKAN resources each time a new one is added, and each time an existing resource URL is changed.
- openness_score
- openness_score_reason
- openness_score_failure_count
- entity_id: the ID of the CKAN resource
- entity_type: resource
- task_type: qa
- last_updated: time at which the QA task finished
- New in CKAN 1.5.1: QA information is calculated automatically (in the background) for individual CKAN resources each time a new one is added, and each time an existing resource URL is changed.
QA results are current saved to CKAN's TaskStatus table, and not on to resource objects directly.
Three key/value pairs are currently saved for each resource:
- openness_score
- openness_score_reason
- openness_score_failure_count
- entity_id: the ID of the CKAN resource
- entity_type: resource
- task_type: qa
- last_updated: time at which the QA task finished
QA information can be viewed at '/qa' on any CKAN instance that has the QA extension installed.
QA results can be read from the Task Status table using CKAN API v3.