Skip to content

stta9/healthcare-stroke-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Healthcare: Stroke prediction

Installation

  1. Clone the repository:
git clone https://github.com/stta9/healthcare-stroke-prediction.git
cd healthcare-stroke-prediction
  1. Install the required packages:
pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install scikit-learn
pip install imbalanced-learn

Dataset

The dataset used for this project is the Healthcare Stroke Dataset. It contains various features related to patients' health metrics and the target variable stroke, indicating whether the patient had a stroke.

Data Preprocessing

Load the dataset:

df = pd.read_csv("/content/healthcare-dataset-stroke-data.csv.xls")

Remove unnecessary columns and duplicates:

df.drop(["id"], axis=1, inplace=True)
df.drop_duplicates(inplace=True)

Handle missing values:

df['bmi'] = df['bmi'].fillna(df['bmi'].mean())

Remove irrelevant entries:

df.drop(df[df['gender'] == "Other"].index, axis=0, inplace=True)

Encode categorical variables:

label_encoder = LabelEncoder()
for col in categorical:
    df[col] = label_encoder.fit_transform(df[col])

Exploratory Data Analysis

df.info()

Visualize numerical features:

plt.figure(figsize=(15, 5))
for i in range(len(numerical)):
    plt.subplot(1, len(numerical), i + 1)
    sns.boxplot(x=df[numerical[i]])
plt.show()

Visualize categorical features:

figure, axis = plt.subplots(4, 2, figsize=(10, 15))
plt.subplots_adjust(hspace=0.5, wspace=0.3)
for i, column_name in enumerate(categorical + ['stroke']):
    row = i // 2
    col = i % 2
    barplot = sns.countplot(ax=axis[row, col], x=df[column_name], hue=df[column_name])
    barplot.set_xticklabels(barplot.get_xticklabels(), rotation=30, size=8)
plt.show()

Modeling

Split the data into training and testing sets:

X = df.drop(columns=['stroke'])
y = df['stroke']
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Train and evaluate models:

Decision Tree Classifier:

tree_model = DecisionTreeClassifier(criterion='entropy')
tree_model.fit(X_train, y_train)
y_pred = tree_model.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))

Logistic Regression:

lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))

Random Forest Classifier:

rf_model = RandomForestClassifier(n_estimators=150, criterion='entropy', random_state=123)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))

Conclusion

Based on the accuracy results, the Logistic Regression model performed the best among the three models tested.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published