Skip to content

Backend & crawler for the OSS catalog of Developers Italia

License

Notifications You must be signed in to change notification settings

andreapoli/developers-italia-backend

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Backend and crawler for the OSS catalog of Developers Italia

CircleCI Go Report Card Join the #website channel Get invited

Overview: how the crawler works

The crawler finds and retrieves the publiccode.yml files from the organizations registered on Github/Bitbucket/Gitlab, listed in the whitelist. It then creates YAML files used by the Jekyll build chain to generate the static pages of developers.italia.it.

Dependencies and other related software

These are the dependencies and some useful tools used in conjunction with the crawler.

  • Elasticsearch 6.8.7 for storing the data. Elasticsearch should be active and ready to accept connections before the crawler gets started

  • Kibana 6.8.7 for internal data visualization (optional)

  • Prometheus 6.8.7 for collecting metrics (optional, currently supported but not used in production)

Tools

This is the list of tools used in the repository:

Setup and deployment processes

The crawler can either run directly on the target machine, or it can be deployed in form of Docker container, possibly using an orchestrator, such as Kubernetes.

Up to now, the crawler and its dependencies have run in form of Docker containers on a virtual machine. Elasticsearch and Kibana have been deployed using a fork of the main project, called search guard. This is still deployed in production and what we'll call in the readme "legacy deployment process".

With the idea of making the legacy installation more scalable and reliable, a refactoring of the code has been recently made. The readme refers to this approach as the new deployment process. This includes using the official version of Elasticsearch and Kibana, and deploying the Docker containers on top of Kubernetes, using helm-charts. While the crawler has it's own helm-chart, Elasticsearch and Kibana are deployed using their official helm-charts. The new deployment process uses a docker-compose.yml file to only bring up a local development and test environment.

The paragraph starts describing how to build and run the crawler, directly on a target machine. The procedure described is the same automated in the Dockerfile. The -legacy and new- Docker deployment procedures are then described below.

Manually configure and build the crawler

  • cd crawler

  • Fill the domains.yml file with configuration values (i.e. host basic auth tokens)

  • Rename the config.toml.example file to config.toml and fill the variables

NOTE: The application also supports environment variables in substitution to config.toml file. Remember: "environment variables get higher priority than the ones in configuration file"

  • Build the crawler binary: make

  • Start the crawler: bin/crawler crawl whitelist/*.yml

  • Configure the crontab as desired

Run the crawler

  • bin/crawler updateipa downloads IPA data and writes them into Elasticsearch

  • bin/crawler download-whitelist downloads organizations and repositories from the onboarding portal repository and saves them to a whitelist file

Docker: the legacy deployment process

The paragraph describes how to setup and deploy the crawler, following the legacy deployment process.

  • Rename .env-search-guard.example to .env and adapt its variables as needed

  • Rename elasticsearch-searchguard/config/searchguard/sg_internal_users.yml.example to elasticsearch/-searchguard/config/searchguard/sg_internal_users.yml and insert the correct passwords. Hashed passwords can be generated with:

    docker exec -t -i developers-italia-backend_elasticsearch elasticsearch-searchguard/plugins/search-guard-6/tools/hash.sh -p <password>
  • Insert the kibana password in kibana-searchguard/config/kibana.yml

  • Configure the Nginx proxy for the elasticsearch host with the following directives:

    limit_req_zone $binary_remote_addr zone=elasticsearch_limit:10m rate=10r/s;
    
    server {
        ...
        location / {
            limit_req zone=elasticsearch_limit burst=20 nodelay;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_pass http://localhost:9200;
            proxy_ssl_session_reuse off;
            proxy_cache_bypass $http_upgrade;
            proxy_redirect off;
        }
    }
    
  • You might need to type sysctl -w vm.max_map_count=262144 and make this permanent in /etc/sysctl.conf in order to start elasticsearch, as documented here

  • Start Docker: make up

Docker: the new deployment process

The repository has a Dockerfile, used to also build the production image, and a docker-compose.yml file to facilitate the local deployment.

The containers declared in the docker-compose.yml file leverage some environment variables that should be declared in a .env file. A .env.example file has some exemplar values. Before proceeding with the build, copy the .env.example into .env and modify the environment variables as needed.

To build the crawler container, download its dependencies and start them all, run:

docker-compose up [-d] [--build]

where:

  • -d execute the containers in background

  • --build forces the containers build

To destroy the containers, use:

docker-compose down

Xpack

By default, the system -specifically Elasticsearch- doesn't make use of xpack, so passwords and certificates. To do so, the Elasticsearch container mounts this configuration file. This will make things work out of the box, but it's not appropriate for production environments.

An alternative configuration file that enables xpack is available here. In order to use it, you should

At this point you can bring up the environment with docker-compose.

Troubleshooting Q/A

  • From docker logs seems that Elasticsearch container needs more virtual memory and now it's Stalling for Elasticsearch...

    Increase container virtual memory: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode

  • When trying to make build the crawler image, a fatal memory error occurs: "fatal error: out of memory"

    Probably you should increase the container memory: docker-machine stop && VBoxManage modifyvm default --cpus 2 && VBoxManage modifyvm default --memory 2048 && docker-machine stop

See also

Authors

Developers Italia is a project by AgID and the Italian Digital Team, which developed the crawler and maintains this repository.

About

Backend & crawler for the OSS catalog of Developers Italia

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 85.4%
  • Shell 13.4%
  • Other 1.2%