Skip to content

explore large multilingual documentary collections from their semantics

License

Notifications You must be signed in to change notification settings

TBFY/search-API

Repository files navigation

                   Java Maven Build Status Release Status GitHub Issues License DOI

Basic Overview

Explore collections of multilingual public procurement data through a Restful API:

  • /documents : list of existing documents
  • /documents/{id} : details of a document
  • /documents/{id}/items : similar documents

Or search for a similar document given a text:

  • /items : similar documents

Quick Start

  1. A Swagger-based API is available online at:
    http://tbfy.librairy.linkeddata.es/search-api
  2. Get the list of available documents, and filter by language or source, using /documents:
    http://tbfy.librairy.linkeddata.es/search-api/documents
  3. Get the content, and additional information, of a document through /documents/{id}:
    http://tbfy.librairy.linkeddata.es/search-api/documents/jrc32002D0996-en
  4. Obtain similar documents, regardless of language, through /documents/{id}/items: http://tbfy.librairy.linkeddata.es/search-api/documents/jrc32002D0996-en/items
  5. To obtain only documents in Spanish, just add lang=es to the query:
    http://tbfy.librairy.linkeddata.es/search-api/documents/jrc32002D0996-en/items?lang=es

Similar documents to a free text can also be searched. All you have to do is make a HTTP-POST request with a json like this at :

{
  "size": 10,
  "source": "jrc",
  "text": "Council Directive 9343EEC on the hygiene of foodstuffs as regards the transport of bulk liquid oils and fats by seaText with EEA relevance."
}

In order to obtain only documents in Spanish, just add lang=es to the json:

{
  "size": 10,
  "source": "jrc",
  "text": "Council Directive 9343EEC on the hygiene of foodstuffs as regards the transport of bulk liquid oils and fats by seaText with EEA relevance.",
  "lang":"es"
}

Index Documents

  1. Download the latest data dump available at Zenodo:
    https://doi.org/10.5281/zenodo.3783736
  2. Unzip it, for example in /tmp. A folder is created per month.
  3. Download the indexing script. It is implemented in Python, but is easily exportable to other languages:
    http://tbfy.librairy.linkeddata.es/search-api/src/main/python/index-tenders.py
  4. Edit it to set the root directory where the documents are. For example /tmp:
    main('/tmp/20*')
    
    As you can see, a filtering of directories to be indexed can be defined in the path itself by adding * characters.
  5. Run it! That's it.

More info here

Lastest Stable Release

This tool is part of the librAIry ecosystem, and needs librAIry-API for deployment.

  • It can start as a service via docker-compose.yml:
  • Or through Maven dependencies:
    1. Add the JitPack repository to your build file
        <repositories>
    	      <repository>
    	        <id>jitpack.io</id>
    	        <url>https://jitpack.io</url>
    	      </repository>
          </repositories>
    1. Add the dependency
        <dependency>
             <groupId>com.github.TBFY</groupId>
             <artifactId>search-API</artifactId>
             <version>last-stable-release-version</version>
      </dependency>

Contributing

Please take a look at our contributing guidelines if you're interested in helping!