Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding READme file to Github-Automated-Analysis #1647

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions Data Analysis/Github-Automated-Analysis/READme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# **GitHub-Automated-Analysis**

## **GOAL**

The objective of this project is to perform automated analysis on GitHub repositories, leveraging data-driven insights to assess repository metrics, developer activity, and project trends. Using a dataset of GitHub repository statistics and user engagement data, the project generates automated reports to facilitate decision-making and forecast repository performance.

## **WHAT HAVE I DONE**

1. **Data Collection**
- Extracted data from GitHub API on repository statistics.
- Gathered information on stars, forks, issues, pull requests, and commit history.

2. **Data Cleaning and Preprocessing**
- Managed missing values and standardized data formats.
- Converted timestamps to datetime objects for analysis.

3. **Exploratory Data Analysis (EDA)**
- Analyzed trends in stars, forks, and commits over time.
- Visualized contributor activity patterns and issue response rates.
- Examined correlation between repository metrics.

4. **Feature Engineering**
- Created new features from repository data (e.g., activity score, engagement rate).
- Applied one-hot encoding for categorical data.

5. **Data Splitting**
- Split the data into training and testing sets for model evaluation.

6. **Modeling and Forecasting**
- **Linear Regression**: Accuracy - 85.5%
- **Decision Tree**: Accuracy - 94.7%
- **Random Forest**: Accuracy - 98.9%
- **XGBoost**: Accuracy - 99.2%
- Fine-tuned models to improve predictive performance.

7. **Model Persistence**
- Saved trained models (Random Forest and XGBoost) for future predictions.
- Loaded models to generate automated repository reports.

## **MODELS USED**

- **Linear Regression**: Modeled relationships between repository metrics and overall performance trends.
- **Random Forest**: Employed ensemble learning for improved prediction accuracy using decision trees.
- **Decision Tree**: Created a model based on key features for rapid predictions.
- **XGBoost**: Utilized gradient boosting for optimal forecasting results.
- **GridSearchCV**: Optimized hyperparameters to enhance model accuracy.

## **LIBRARIES NEEDED**

- `numpy`
- `pandas`
- `matplotlib`
- `seaborn`
- `scikit-learn`
- `datetime`
- `xgboost`
- `requests` (for GitHub API)
- `pickle`

## **CONCLUSION**

The project highlights an automated approach to analyzing GitHub repositories, assessing metrics that reflect project engagement and growth potential. Using machine learning models, it achieves high accuracy in forecasting repository trends, with the XGBoost model performing the best at 99.2% accuracy. The results emphasize the benefits of automated insights for project management and community engagement.