-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathindex.Rmd
301 lines (196 loc) · 13 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
---
title: "Data Science: Applied Text Mining"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: scroll
css: style.css
---
# Intro {.sidebar}
This dashboard covers the course materials for the course [S42: Data Science: Applied Text Mining](https://www.utrechtsummerschool.nl/courses/social-sciences/data-science-applied-text-mining).
---
<!-- <center> -->
<!-- ![](logo.jpg){width=100%} -->
<!-- </center> -->
<!-- --- -->
<!-- ADD COURSE INFO -->
<!-- Instructor: FILL -->
<!-- Study load: FILL -->
<!-- Assessment: FILL -->
<!-- --- -->
\
Course director: [Ayoub Bagheri](https://ayoubbagheri.nl/)
Instructors:
+ [Dong Nguyen](https://dongnguyen.nl/)\
- [Arjan van Dalfsen](https://www.uu.nl/staff/jadalfsen)\
- [Pablo Mosteiro](https://www.uu.nl/staff/PJMosteiroRomero)\
- [Jelte van Boheemen](https://www.uu.nl/medewerkers/JvanBoheemen)\
* [Ayoub Bagheri](https://ayoubbagheri.nl/)
Study load: 1.5 ECTS
Location: [Koningsberger Building, lecture hall Atlas](https://www.uu.nl/en/victor-j-koningsberger-building)
---
# Quick Overview
## Column 1
### Outline
Given the rapid rate at which text data are being digitally gathered in many domains of science, there is growing need for automated tools that can analyse, classify, and interpret this kind of data. Text mining techniques can be applied to create a structured representation of text, making its content more accessible for researchers. Applications of text mining are everywhere: social media, web search, advertising, emails, customer service, healthcare, marketing, etc. This course offers an extensive exploration into text mining with Python. The course has a strongly practical hands-on focus, and students will gain experience in using text mining on real data from for example social sciences and healthcare and interpreting the results. Through lectures and practicals, the students will learn the necessary skills to design, implement, and understand their own text mining pipeline. The topics in this course include preprocessing text, text classification, topic modeling, word embedding, deep learning models, and responsible text mining.
The course deals with the following topics:
* Review the fundamental approaches to text mining;
* Understand and apply current methods for analysing texts;
* Define a text mining pipeline given a practical data science problem;
* Implement all steps in a text mining pipeline: feature extraction, feature selection, model learning, model evaluation;
* Understand and apply state-of-the-art methods in text mining;
* Implement word embedding and advanced deep learning techniques.
The course starts with reviewing basic concepts of text mining and implementing advanced concepts in natural language processing. At the end of the week, participants will master advanced skills of text mining with Python.
### Requirements
Participants should have a basic knowledge of data science and programming and a motivation of scripting and programming in Python.
### Prerequisites
Participants are requested to bring their own laptop for the lab meetings.
### Certificate
Participants will receive a certificate at the end of the course.
### Additional references
1- Jurafsky, D., Martin, J.H. (2024). Speech and language processing, third edition. Find online chapters [here](https://web.stanford.edu/~jurafsky/slp3/)
2- Eisenstein, J. (2018). Natural Language Processing. Find online chapters [here](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
## Column 2
### Daily schedule
| Start time | End time | Type |
|:-----------|:---------|:----------|
| 09:00 | 10:30 | Lecture |
| |**Break** | |
| 10:50 | 12:00 | Practical |
| 12:00 | 12:30 | Discussion|
| |**[Lunch at Vening Meinesz building (A)](https://www.uu.nl/en/vening-meineszgebouw-a)** | |
| 14:00 | 15:20 | Lecture |
| |**Break** | |
| 15:30 | 16:30 | Practical |
| 16:30 | 17:00 | Discussion|
# How to prepare
## Column 1
### Preparing your machine for the course
Dear all,
This summer you will participate in the S42: Data Science: Applied Text Mining course in Utrecht, the Netherlands. To realize a steeper learning curve, we will use some functionality from the Python programming language using Google Colab. The below steps guide you through how to use Python and work on the practicals in this course.
If you follow this course online please have a look at [this instructional page on MS Teams.](https://www.uu.nl/en/education/quality-and-innovation/remote-teaching/microsoft-teams)
We look forward to see you all in Utrecht and online.
_The Applied Text Mining team_
### System requirements
Bring a laptop computer to the course and make sure that you have an Internet connection to be able to use Python in Google Colab. If you are using PyCharm or Jupyte Notebook, also check that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. Some corporate laptops come with limited access for their users, we therefore advise you to bring a personal laptop computer, if you have one.
### Python in Google Colab
Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming language. It is a powerful environment for scientific computing.
We expect that many of you will have some experience with Python; for the rest of you, this section will serve as a quick crash course both on the Python programming language and on the use of Python in Google Colab:
Follow the [tutorial on Python in Google Colab for the Applied Text Mining course from here.](Python_in_Google_Colab_Applied_Text_Mining.html)
This tutorial is mainly from the [CS231n Python Tutorial With Google Colab](https://colab.research.google.com/github/cs231n/cs231n.github.io/blob/master/python-colab.ipynb#scrollTo=dzNng6vCL9eP).
## Column 2
### Alternative approach
I prefer to use Jupyter Notebook or PyCharm when I am not using Google Colab.
- You can find an extensive tutorial by JetBrains on how to install and use PyCharm from [here.](https://www.jetbrains.com/help/pycharm/quick-start-guide.html)
- A beginner’s tutorial on how to use Jupyter Notebook: [link](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)
# Monday
## Column 1
### Materials
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. Lectures are provided in html and pdf formats. Practical files contain the exercises, in two versions, with and without solutions.
Here you will find the materials for Monday:
- Part 1: Introduction
- [Lecture 1](monday/Lectures/lecture 1/Lecture_1.html)
- [Lecture handout 1](monday/Lectures/lecture 1/Lecture handout_1.pdf)
- [Practical 1](monday/Practicals/Practical 1/Impractical 1.html)
- [Answers practical 1](monday/Practicals/Practical 1/Practical 1.html)
- [Data for practical 1](monday/Practicals/Practical 1/data.zip)
- Part 2: Text classification
- [Lecture 2](monday/Lectures/lecture 2/Lecture_2.pdf)
- [Lecture handout 2](monday/Lectures/lecture 2/Lecture handout_2.pdf)
- [Practical 2](monday/Practicals/Practical 2/Impractical 2.html)
- [Answers practical 2](monday/Practicals/Practical 2/Practical 2.html)
- [Data for practical 2](monday/Practicals/Practical 2/data.zip)
## Column 2
### Additional references
- Ref 1: Chapters 2, 3, 4, 5, 6
- Ref 2: Chapters 1, 2, 3, 4
# Tuesday
## Column 1
### Materials
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. Lectures are provided in html and pdf formats. Practical files contain the exercises, in two versions, with and without solutions.
Here you will find the materials for Tuesday:
- Part 3: Feature selection
- [Lecture 3](tuesday/Lectures/lecture 3/Lecture_3.html)
- [Lecture handout 3](tuesday/Lectures/lecture 3/Lecture handout_3.pdf)
- [Practical 3](tuesday/Practicals/Practical 3/Impractical 3.html)
- [Answers practical 3](tuesday/Practicals/Practical 3/Practical 3.html)
- [Data for practical 3](tuesday/Practicals/Practical 3/data.zip)
- Part 4: Text clustering & topic modeling
- [Lecture 4](tuesday/Lectures/lecture 4/Lecture_4.html)
- [Lecture handout 4](tuesday/Lectures/lecture 4/Lecture handout_4.pdf)
- [Practical 4](tuesday/Practicals/Practical 4/Impractical 4.html)
- [Answers practical 4](tuesday/Practicals/Practical 4/Practical 4.html)
- [Data for practical 4](tuesday/Practicals/Practical 4/bbcsport-fulltext.zip)
## Column 2
### Additional references
- Ref 2: Chapter 5
# Wednesday
## Column 1
### Materials
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. Lectures are provided in html and pdf formats. Practical files contain the exercises, in two versions, with and without solutions.
Here you will find the materials for Wednesday:
- Part 5: Word embedding
- [Lecture 5](wednesday/Lectures/lecture 5/Lecture_5.pdf)
- [Lecture handout 5](wednesday/Lectures/lecture 5/Lecture handout_5.pdf)
- [Practical 5](wednesday/Practicals/Practical 5/Impractical 5.html)
- [Answers practical 5](wednesday/Practicals/Practical 5/Practical 5.html)
- Part 6: Deep learning for text 1
- [Lecture 6](wednesday/Lectures/lecture 6/Lecture_6.html)
- [Lecture handout 6](wednesday/Lectures/lecture 6/Lecture handout_6.pdf)
- [Practical 6](wednesday/Practicals/Practical 6/Impractical 6.html)
- [Answers practical 6](wednesday/Practicals/Practical 6/Practical 6.html)
## Column 2
### Additional references
- Ref 1: Chapters 6, 7, 9
- Ref 2: Chapters 6, 14
# Thursday
## Column 1
### Materials
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. Lectures are provided in html and pdf formats. Practical files contain the exercises, in two versions, with and without solutions.
Here you will find the materials for Thursday:
- Part 7: Deep learning for text 2
- [Lecture 7](thursday/Lectures/lecture 7/Lecture_7.html)
- [Lecture handout 7](thursday/Lectures/lecture 7/Lecture handout_7.pdf)
- [Practical 7](thursday/Practicals/Practical 7/Impractical 7.html)
- [Answers practical 7](thursday/Practicals/Practical 7/Practical 7.html)
- Part 8: Sentiment analysis
- [Lecture 8](thursday/Lectures/lecture 8/Lecture_8.html)
- [Lecture handout 8](thursday/Lectures/lecture 8/Lecture handout_8.pdf)
- [Practical 8](thursday/Practicals/Practical 8/Impractical 8.html)
- [Answers practical 8](thursday/Practicals/Practical 8/Practical 8.html)
## Column 2
### Additional references
- Ref 1: Chapters 6, 7, 9
- Ref 2: Chapters 6, 14
# Friday
## Column 1
### Materials
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. Lectures are provided in html and pdf formats. Practical files contain the exercises, in two versions, with and without solutions.
Here you will find the materials for Friday:
- Part 9: Applications of text mining & NLP
- [Lecture 9](friday/Lectures/lecture 9/Lecture_9.pdf)
- [Lecture handout 9](friday/Lectures/lecture 9/Lecture handout_9.pdf)
- [Practical 9 (with an example answer)](friday/Practicals/Practical 9/Practical 9.html)
- [Data for practical 9](friday/Practicals/Practical 9/data.zip)
- Part 10: Responsible text mining
- [Lecture 10](friday/Lectures/lecture 10/Lecture_10.pdf)
- [Lecture handout 10](friday/Lectures/lecture 10/Lecture handout_10.pdf)
- [Practical 10: classification with BERT](friday/Practicals/Practical 10/Practical 10.html)
<!-- - [Answers practical 10](friday/Practicals/Practical 10/Impractical 10.html) -->
- [Additional practical: machine translation](friday/Practicals/Practical 10/Bonus/Bonus_Practical 10.html)
- [Data for additional practical: machine translation](friday/Practicals/Practical 10/Bonus/nld-eng.zip)
## Column 2
### Additional references
- Ref 1: Chapters 6, 7, 9
- Ref 2: Chapters 6, 14
# Archive
### Interesting links
- [Financial times: Generative AI exists because of the transformer](https://ig.ft.com/generative-ai/)
- [Talk: Intro to Large Language Models](https://youtu.be/zjkBMFhNj_g?si=cgExVbVuUV6HvqSx)
- [Talk: LSTM is dead. Long Live Transformers!](https://youtu.be/S27pHKBEp30?si=LtW0WydG_Ep0lXCC)
- [What Is ChatGPT Doing … and Why Does It Work?](https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/)
### All materials in a ZIP file
On the **last day of the course**, all the materials will be available in a compact file for download:
[Download the Materials](materials/Applied_TM_all_materials.zip)
We wish all the participants success with their Text Mining projects!
*The Applied Text Mining team*