Skip to content

Commit

Permalink
add note
Browse files Browse the repository at this point in the history
latent  text analysis
  • Loading branch information
math4mad committed Nov 1, 2023
1 parent 9aa9d52 commit aad4ca2
Show file tree
Hide file tree
Showing 5 changed files with 193 additions and 0 deletions.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"hash": "d41cd0b55e16578aa27784e4e5b0135f",
"result": {
"markdown": "---\ntitle: \"1-text-analysis\"\nauthor: math4mad\ncode-fold: true\n---\n\n::: {.cell execution_count=1}\n``` {.julia .cell-code}\nusing MLJ\nimport TextAnalysis\nTfidfTransformer = @load TfidfTransformer pkg=MLJText\ndocs=[\"Romeo and Juliet\",\n \"Juliet: O happy dagger!\",\n \"Romeo died by dagger\",\n \"“Live free or die”, that’s the New-Hampshire’s motto.\",\n \"Did you know, New-Hampshire is in New-England\"]\ntfidf_transformer = TfidfTransformer()\ntokenized_docs = TextAnalysis.tokenize.(docs)\nmach = machine(tfidf_transformer, tokenized_docs)\nfit!(mach)\n\nfitted_params(mach)\n\ntfidf_mat = transform(mach, tokenized_docs)|>Matrix\nvcat(tokenized_docs...)|>Set\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nimport MLJText ✔\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n[ Info: For silent loading, specify `verbosity=0`. \n[ Info: Training machine(TfidfTransformer(max_doc_freq = 1.0, …), …).\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=22}\n```\nSet{String} with 30 elements:\n \"dagger\"\n \"!\"\n \"is\"\n \"Juliet\"\n \"and\"\n \"O\"\n \"happy\"\n \"by\"\n \"Live\"\n \"free\"\n \",\"\n \"or\"\n \"that\"\n \"Romeo\"\n \"’\"\n \"motto\"\n \"New-England\"\n \"in\"\n \"s\"\n \".\"\n \"died\"\n \":\"\n \"you\"\n \"the\"\n \"Did\"\n ⋮ \n```\n:::\n:::\n\n\n",
"supporting": [
"1-text-analysis_files"
],
"filters": [],
"includes": {}
}
}

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions category/textanalysis/1-text-analysis.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: "1-text-analysis"
author: math4mad
code-fold: true
---

```{julia}
using MLJ
import TextAnalysis
TfidfTransformer = @load TfidfTransformer pkg=MLJText
docs=["Romeo and Juliet",
"Juliet: O happy dagger!",
"Romeo died by dagger",
"“Live free or die”, that’s the New-Hampshire’s motto.",
"Did you know, New-Hampshire is in New-England"]
tfidf_transformer = TfidfTransformer()
tokenized_docs = TextAnalysis.tokenize.(docs)
mach = machine(tfidf_transformer, tokenized_docs)
fit!(mach)
fitted_params(mach)
tfidf_mat = transform(mach, tokenized_docs)|>Matrix
vcat(tokenized_docs...)|>Set
```
127 changes: 127 additions & 0 deletions category/textanalysis/2-latent-semantic-analysis.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
title: "2-Latent Semantic Analysis"
author: math4mad
code-fold: true
---


:::{.callout-note title="简介"}
参考 [Latent Semantic Analysis](https://www.engr.uvic.ca/~seng474/svd.pdf)
在本教程里, 有多个文本,但是实际只有两个来源, 一个来源是莎士比亚的罗密欧与朱丽叶, 一个是关于
地理位置的介绍文档. 所以要解决的问题就是能不能通过数学方法知道这两个不同的群.
查询文档通过距离度量划分到对应的群中
:::

## 1. Eigenvalues and Eigenvectors


$$\begin{bmatrix}
4&0 &0 \\
0&3 &0 \\
0&0 &2
\end{bmatrix}$$

## 2. workflow

### 2.1 load package
```{julia}
import Plots: text
using DataFrames
using LinearAlgebra
using Plots
using PrettyTables
```

### 2.2 documents
```{julia}
docments=(:d1=>"Romeo and Juliet",
:d2=>"Juliet: O happy dagger!",
:d3=>"Romeo died by dagger",
:d4=>"“Live free or die”, that’s the New-Hampshire’s motto.",
:d5=> "Did you know, New-Hampshire is in New-England")
querystring=["dies", "dagger"]
```
### 2.3 tokenize
对文本分词, 获得文档矩阵, 参考上面连接
```{julia}
terms=marker=["romeo ", "juliet", "happy ", "dagger" ,"live", "die", "free", "new-hampshire "]
Mat=[1 0 1 0 0 ; 1 1 0 0 0 ; 0 1 0 0 0 ; 0 1 1 0 0 ;0 0 0 1 0;0 0 1 1 0 ; 0 0 0 1 0 ; 0 0 0 1 1 ]
df=DataFrame(terms=terms,d1=Mat[:,1],d2=Mat[:,2],d3=Mat[:,3],d4=Mat[:,4],d5=Mat[:,5])
@pt df
```

### 2.4 SVD
`svd(matrix)->get first 2 components`
```{julia}
U,Σ,V=svd(Mat)
show(:Σ=>Σ)
k=2
U₂=U[:,1:k]
Σ₂=diagm(Σ[1:k])
tV₂=V[:,1:k]'
terms= U₂*Σ₂ # 每一行是词条的向量
display(terms)
doc=Σ₂*tV₂ # 每列是文本的向量
```

### 2.5 定义注释文本方法
```{julia}
offset=0.2 #添加文本的偏移量
"""
anno(str;x,y,xoffset=0,yoffset=0.1)
anno(str;x,y,xoffset=0,yoffset=0.1)
用于文本注释
params:
- str: 文本内容
- x,y 文本坐标
- xoffset,yoffset 偏移,避免与数据点重合,定位在 y轴
"""
function anno(str;x,y,xoffset=0,yoffset=0.1)
return (x+xoffset,y+yoffset,
text(str, pointsize=6, color=:blue, halign=:center, valign=:center, rotation=0))
end
```


### 2.6 plot res
```{julia}
text_arr=[]
doc_arr=[]
for i in 1:8
txt=anno(marker[i];x=terms[i,1],y=terms[i,2])
push!(text_arr,txt)
end
scatter(terms[:,1],terms[:,2],ann=text_arr,label="terms",frame=:box,size=(600,400))
for i in 1:5
t=doc[:,i]
txt=anno("d-$i";x=t[1],y=t[2])
push!(doc_arr,txt)
end
scatter!(doc[1,:],doc[2,:],ann=doc_arr,label="docs")
scatter!([0],[0],label="origin")
```

### 2.6 querystring res
查询文本单词为 对应为`terms` 中的第 5和第 7 行, 查询文本为两个坐标的均值
```{julia}
query_coord=(terms[4,:]+terms[6,:])/2
query_ann=anno("$querystring";x=query_coord[1],y=query_coord[2])
scatter!([query_coord[1]],[query_coord[2]], ann=query_ann,label="query doc")
```









0 comments on commit aad4ca2

Please sign in to comment.