generated from Pakillo/quarto-course-website-template
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
latent text analysis
- Loading branch information
Showing
5 changed files
with
193 additions
and
0 deletions.
There are no files selected for viewing
15 changes: 15 additions & 0 deletions
15
_freeze/category/textanalysis/1-latent-semantic-analysis/execute-results/html.json
Large diffs are not rendered by default.
Oops, something went wrong.
11 changes: 11 additions & 0 deletions
11
_freeze/category/textanalysis/1-text-analysis/execute-results/html.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
{ | ||
"hash": "d41cd0b55e16578aa27784e4e5b0135f", | ||
"result": { | ||
"markdown": "---\ntitle: \"1-text-analysis\"\nauthor: math4mad\ncode-fold: true\n---\n\n::: {.cell execution_count=1}\n``` {.julia .cell-code}\nusing MLJ\nimport TextAnalysis\nTfidfTransformer = @load TfidfTransformer pkg=MLJText\ndocs=[\"Romeo and Juliet\",\n \"Juliet: O happy dagger!\",\n \"Romeo died by dagger\",\n \"“Live free or die”, that’s the New-Hampshire’s motto.\",\n \"Did you know, New-Hampshire is in New-England\"]\ntfidf_transformer = TfidfTransformer()\ntokenized_docs = TextAnalysis.tokenize.(docs)\nmach = machine(tfidf_transformer, tokenized_docs)\nfit!(mach)\n\nfitted_params(mach)\n\ntfidf_mat = transform(mach, tokenized_docs)|>Matrix\nvcat(tokenized_docs...)|>Set\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nimport MLJText ✔\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n[ Info: For silent loading, specify `verbosity=0`. \n[ Info: Training machine(TfidfTransformer(max_doc_freq = 1.0, …), …).\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=22}\n```\nSet{String} with 30 elements:\n \"dagger\"\n \"!\"\n \"is\"\n \"Juliet\"\n \"and\"\n \"O\"\n \"happy\"\n \"by\"\n \"Live\"\n \"free\"\n \",\"\n \"or\"\n \"that\"\n \"Romeo\"\n \"’\"\n \"motto\"\n \"New-England\"\n \"in\"\n \"s\"\n \".\"\n \"died\"\n \":\"\n \"you\"\n \"the\"\n \"Did\"\n ⋮ \n```\n:::\n:::\n\n\n", | ||
"supporting": [ | ||
"1-text-analysis_files" | ||
], | ||
"filters": [], | ||
"includes": {} | ||
} | ||
} |
15 changes: 15 additions & 0 deletions
15
_freeze/category/textanalysis/2-latent-semantic-analysis/execute-results/html.json
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
--- | ||
title: "1-text-analysis" | ||
author: math4mad | ||
code-fold: true | ||
--- | ||
|
||
```{julia} | ||
using MLJ | ||
import TextAnalysis | ||
TfidfTransformer = @load TfidfTransformer pkg=MLJText | ||
docs=["Romeo and Juliet", | ||
"Juliet: O happy dagger!", | ||
"Romeo died by dagger", | ||
"“Live free or die”, that’s the New-Hampshire’s motto.", | ||
"Did you know, New-Hampshire is in New-England"] | ||
tfidf_transformer = TfidfTransformer() | ||
tokenized_docs = TextAnalysis.tokenize.(docs) | ||
mach = machine(tfidf_transformer, tokenized_docs) | ||
fit!(mach) | ||
fitted_params(mach) | ||
tfidf_mat = transform(mach, tokenized_docs)|>Matrix | ||
vcat(tokenized_docs...)|>Set | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
--- | ||
title: "2-Latent Semantic Analysis" | ||
author: math4mad | ||
code-fold: true | ||
--- | ||
|
||
|
||
:::{.callout-note title="简介"} | ||
参考 [Latent Semantic Analysis](https://www.engr.uvic.ca/~seng474/svd.pdf) | ||
在本教程里, 有多个文本,但是实际只有两个来源, 一个来源是莎士比亚的罗密欧与朱丽叶, 一个是关于 | ||
地理位置的介绍文档. 所以要解决的问题就是能不能通过数学方法知道这两个不同的群. | ||
查询文档通过距离度量划分到对应的群中 | ||
::: | ||
|
||
## 1. Eigenvalues and Eigenvectors | ||
|
||
|
||
$$\begin{bmatrix} | ||
4&0 &0 \\ | ||
0&3 &0 \\ | ||
0&0 &2 | ||
\end{bmatrix}$$ | ||
|
||
## 2. workflow | ||
|
||
### 2.1 load package | ||
```{julia} | ||
import Plots: text | ||
using DataFrames | ||
using LinearAlgebra | ||
using Plots | ||
using PrettyTables | ||
``` | ||
|
||
### 2.2 documents | ||
```{julia} | ||
docments=(:d1=>"Romeo and Juliet", | ||
:d2=>"Juliet: O happy dagger!", | ||
:d3=>"Romeo died by dagger", | ||
:d4=>"“Live free or die”, that’s the New-Hampshire’s motto.", | ||
:d5=> "Did you know, New-Hampshire is in New-England") | ||
querystring=["dies", "dagger"] | ||
``` | ||
### 2.3 tokenize | ||
对文本分词, 获得文档矩阵, 参考上面连接 | ||
```{julia} | ||
terms=marker=["romeo ", "juliet", "happy ", "dagger" ,"live", "die", "free", "new-hampshire "] | ||
Mat=[1 0 1 0 0 ; 1 1 0 0 0 ; 0 1 0 0 0 ; 0 1 1 0 0 ;0 0 0 1 0;0 0 1 1 0 ; 0 0 0 1 0 ; 0 0 0 1 1 ] | ||
df=DataFrame(terms=terms,d1=Mat[:,1],d2=Mat[:,2],d3=Mat[:,3],d4=Mat[:,4],d5=Mat[:,5]) | ||
@pt df | ||
``` | ||
|
||
### 2.4 SVD | ||
`svd(matrix)->get first 2 components` | ||
```{julia} | ||
U,Σ,V=svd(Mat) | ||
show(:Σ=>Σ) | ||
k=2 | ||
U₂=U[:,1:k] | ||
Σ₂=diagm(Σ[1:k]) | ||
tV₂=V[:,1:k]' | ||
terms= U₂*Σ₂ # 每一行是词条的向量 | ||
display(terms) | ||
doc=Σ₂*tV₂ # 每列是文本的向量 | ||
``` | ||
|
||
### 2.5 定义注释文本方法 | ||
```{julia} | ||
offset=0.2 #添加文本的偏移量 | ||
""" | ||
anno(str;x,y,xoffset=0,yoffset=0.1) | ||
anno(str;x,y,xoffset=0,yoffset=0.1) | ||
用于文本注释 | ||
params: | ||
- str: 文本内容 | ||
- x,y 文本坐标 | ||
- xoffset,yoffset 偏移,避免与数据点重合,定位在 y轴 | ||
""" | ||
function anno(str;x,y,xoffset=0,yoffset=0.1) | ||
return (x+xoffset,y+yoffset, | ||
text(str, pointsize=6, color=:blue, halign=:center, valign=:center, rotation=0)) | ||
end | ||
``` | ||
|
||
|
||
### 2.6 plot res | ||
```{julia} | ||
text_arr=[] | ||
doc_arr=[] | ||
for i in 1:8 | ||
txt=anno(marker[i];x=terms[i,1],y=terms[i,2]) | ||
push!(text_arr,txt) | ||
end | ||
scatter(terms[:,1],terms[:,2],ann=text_arr,label="terms",frame=:box,size=(600,400)) | ||
for i in 1:5 | ||
t=doc[:,i] | ||
txt=anno("d-$i";x=t[1],y=t[2]) | ||
push!(doc_arr,txt) | ||
end | ||
scatter!(doc[1,:],doc[2,:],ann=doc_arr,label="docs") | ||
scatter!([0],[0],label="origin") | ||
``` | ||
|
||
### 2.6 querystring res | ||
查询文本单词为 对应为`terms` 中的第 5和第 7 行, 查询文本为两个坐标的均值 | ||
```{julia} | ||
query_coord=(terms[4,:]+terms[6,:])/2 | ||
query_ann=anno("$querystring";x=query_coord[1],y=query_coord[2]) | ||
scatter!([query_coord[1]],[query_coord[2]], ann=query_ann,label="query doc") | ||
``` | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|