Healthcare: Stroke prediction

Installation

Clone the repository:

git clone https://github.com/stta9/healthcare-stroke-prediction.git
cd healthcare-stroke-prediction

Install the required packages:

pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install scikit-learn
pip install imbalanced-learn

Dataset

The dataset used for this project is the Healthcare Stroke Dataset. It contains various features related to patients' health metrics and the target variable stroke, indicating whether the patient had a stroke.

Data Preprocessing

Load the dataset:

df = pd.read_csv("/content/healthcare-dataset-stroke-data.csv.xls")

Remove unnecessary columns and duplicates:

df.drop(["id"], axis=1, inplace=True)
df.drop_duplicates(inplace=True)

Handle missing values:

df['bmi'] = df['bmi'].fillna(df['bmi'].mean())

Remove irrelevant entries:

df.drop(df[df['gender'] == "Other"].index, axis=0, inplace=True)

Encode categorical variables:

label_encoder = LabelEncoder()
for col in categorical:
    df[col] = label_encoder.fit_transform(df[col])

Exploratory Data Analysis

df.info()

Visualize numerical features:

plt.figure(figsize=(15, 5))
for i in range(len(numerical)):
    plt.subplot(1, len(numerical), i + 1)
    sns.boxplot(x=df[numerical[i]])
plt.show()

Visualize categorical features:

figure, axis = plt.subplots(4, 2, figsize=(10, 15))
plt.subplots_adjust(hspace=0.5, wspace=0.3)
for i, column_name in enumerate(categorical + ['stroke']):
    row = i // 2
    col = i % 2
    barplot = sns.countplot(ax=axis[row, col], x=df[column_name], hue=df[column_name])
    barplot.set_xticklabels(barplot.get_xticklabels(), rotation=30, size=8)
plt.show()

Modeling

Split the data into training and testing sets:

X = df.drop(columns=['stroke'])
y = df['stroke']
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Train and evaluate models:

Decision Tree Classifier:

tree_model = DecisionTreeClassifier(criterion='entropy')
tree_model.fit(X_train, y_train)
y_pred = tree_model.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))

Logistic Regression:

lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))

Random Forest Classifier:

rf_model = RandomForestClassifier(n_estimators=150, criterion='entropy', random_state=123)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))

Conclusion

Based on the accuracy results, the Logistic Regression model performed the best among the three models tested.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
healthcare-dataset-stroke-data.csv.xls		healthcare-dataset-stroke-data.csv.xls
stroke_pred.ipynb		stroke_pred.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Healthcare: Stroke prediction

Installation

Dataset

Data Preprocessing

Exploratory Data Analysis

Modeling

Train and evaluate models:

Decision Tree Classifier:

Logistic Regression:

Random Forest Classifier:

Conclusion

About

Releases

Packages

Languages

stta9/healthcare-stroke-prediction

Folders and files

Latest commit

History

Repository files navigation

Healthcare: Stroke prediction

Installation

Dataset

Data Preprocessing

Exploratory Data Analysis

Modeling

Train and evaluate models:

Decision Tree Classifier:

Logistic Regression:

Random Forest Classifier:

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages