- Clone the repository:
git clone https://github.com/stta9/healthcare-stroke-prediction.git
cd healthcare-stroke-prediction
- Install the required packages:
pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install scikit-learn
pip install imbalanced-learn
The dataset used for this project is the Healthcare Stroke Dataset. It contains various features related to patients' health metrics and the target variable stroke, indicating whether the patient had a stroke.
Load the dataset:
df = pd.read_csv("/content/healthcare-dataset-stroke-data.csv.xls")
Remove unnecessary columns and duplicates:
df.drop(["id"], axis=1, inplace=True)
df.drop_duplicates(inplace=True)
Handle missing values:
df['bmi'] = df['bmi'].fillna(df['bmi'].mean())
Remove irrelevant entries:
df.drop(df[df['gender'] == "Other"].index, axis=0, inplace=True)
Encode categorical variables:
label_encoder = LabelEncoder()
for col in categorical:
df[col] = label_encoder.fit_transform(df[col])
df.info()
Visualize numerical features:
plt.figure(figsize=(15, 5))
for i in range(len(numerical)):
plt.subplot(1, len(numerical), i + 1)
sns.boxplot(x=df[numerical[i]])
plt.show()
Visualize categorical features:
figure, axis = plt.subplots(4, 2, figsize=(10, 15))
plt.subplots_adjust(hspace=0.5, wspace=0.3)
for i, column_name in enumerate(categorical + ['stroke']):
row = i // 2
col = i % 2
barplot = sns.countplot(ax=axis[row, col], x=df[column_name], hue=df[column_name])
barplot.set_xticklabels(barplot.get_xticklabels(), rotation=30, size=8)
plt.show()
Split the data into training and testing sets:
X = df.drop(columns=['stroke'])
y = df['stroke']
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
tree_model = DecisionTreeClassifier(criterion='entropy')
tree_model.fit(X_train, y_train)
y_pred = tree_model.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
rf_model = RandomForestClassifier(n_estimators=150, criterion='entropy', random_state=123)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
Based on the accuracy results, the Logistic Regression model performed the best among the three models tested.