-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
72 lines (49 loc) · 3.57 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# Compranet - Data science against corruption in Mexico
## About:
Procurement biddings are one of the biggest potential opportunities for government agents to divert funds intended for public projects. They constitute the clearest pathway to transfer resources between the public and the private sector, and are subject to a significant degree of simulation, powered by key agents involved in the process. This is implies substantial losses in social welfare: government agencies paying overpriced or low-quality products, and more valuable projects not receiving enough funding.
Despite the fact that information from all procurement contracts is public by law, the data available through the platform does not fully permit interoperability with other public sources of information, which is essential to analyse complex phenomena such as corruption. Moreover, one of the most important tools for this task would be to understand -at least to a certain extent- the social networks that reduce the costs of collusion.
There have been very few and isolated efforts to reconstruct these networks in the Mexican context. Namely, [garrido2017] is an interesting result that exploits self-constructed data of educational relationships between government officials. However, no significant effort has been made in terms of reproducibility.
What we have intended to do for this project can be summarized in three consecutive objectives. Initially, to identify and combine the most important sources of information regarding public procurements and transparency. Then, to construct a historical work relations graph for government officials. Lastly, to implement anomaly detection models in order to assess risk and filter out cases.
**
## Installation
To run the pipeline clone this repository in your master node and run:
```
make init # Install requirements
make create # Create networks, volumes and build images
make setup # Install the compranet module
make run # Run the pipeline
```
### Dependencies
* git
* Python 3.5.2
* luigi
* PostGIS 2.1.4
* psql (PostgreSQL RDS)
* Headless chrome Docker - https://hub.docker.com/r/rsanchezavalos/python-headless-chromedriver/
* ...and more (see `requirements.txt`)
## Data Pipeline
After you create the evironment set up the pipeline_tasks in luigi.cfg
The general process of the pipeline is:
* **Ingest:**
* LocalIngest: Ingest data from multiple sources
* LocalToS3: Upload to S3 and save historical by date
* UpdateOutput: Preprocess to Output
* UpdateDB: Update Postgres tables and Create indexes (see commons/pg_raw_schemas)
* **ETL:**
* MergeDBs: ETL processes to clean
* CleanDB: SQL processes to clean tables creation
* SetNeo4J: Download clean tables and upload to neo4j
* CentralityClassifier: Get features and responses
* **Model:**
* MissingClassifier: Create Missing Index
* CentralityClassifier: Run centrality meassures
### Contributors
| [![taguerram][ph-thalia]][gh-thalia] | [![rsanchezavalos][ph-rsanchez]][gh-rsanchez] | [![monzalo14][ph-monica]][gh-monica] |
| :--: | :--: | :--: |
| [taguerram][gh-thalia] | [rsanchezavalos][gh-rsanchez] | [monzalo14][gh-monica] |
[ph-thalia]: https://avatars0.githubusercontent.com/u/20998351?v=3&s=460
[gh-thalia]: https://github.com/taguerram
[ph-monica]: https://avatars0.githubusercontent.com/u/16139907?v=3&s=460
[gh-monica]: https://github.com/monzalo14
[ph-rsanchez]: https://avatars.githubusercontent.com/u/10931011?v=3&s=460
[gh-rsanchez]: https://github.com/rsanchezavalos