The jac_misc
package contains a collection of miscellaneous models that can be used to perform various tasks such as translation, pdf extraction, personalized head etc. following is a list of all the models available in the jac_misc
package.
Module cluster
implemented for clustering text document into similar clusters. This is a example program to cluster documents with jaseci cluster
module. We will use input as list of raw text documents and will produce cluster labels for each text documents.
-
get_umap
: Reducing the dimension of data while preserving the relationship between data points to identify clusters easier.-
Input
text_embeddings
(list): list of text embeddings should pass here.n_neighbors
(int): number of neighbors to consider.min_dist
(float): minimum distance between clusters.n_components
(int): the dimensionality of the reduced data.random_state
(int): pre-producability of the algorithm.
-
Returns: multidimensional array of reduced features.
-
-
get_cluster_labels
: To get list of possible cluster labels-
Input
embeddings
(list): This accept list of embedded text features.algorithm
(String): Algorithm for clustering.min_samples
(int): The minimum number of data points in a cluster is represented here. Increasing this will reduces number of clustersmin_cluster_size
(int): This represents how conservative you want your clustering should be. Larger values more data points will be considered as noise
-
Returns: list of labels.
-
Input data file text_cluster.json
[
"still waiting card",
"countries supporting",
"card still arrived weeks",
"countries accounts support",
"provide support countries",
"waiting week card still coming",
"track card process delivery",
"countries getting support",
"know get card lost",
"send new card",
"still received new card",
"info card delivery",
"new card still come",
"way track delivery card",
"countries currently support"]
walker text_cluster_example{
can file.load_json;
can use.encode;
can cluster.get_umap;
can cluster.get_cluster_labels;
has text, encode, final_features, labels;
text = file.load_json("text_cluster.json");
encode = use.encode(text);
final_features = cluster.get_umap(encode,2);
labels = cluster.get_cluster_labels(final_features,"hbdscan",2,2);
std.out(labels);
}
For a complete example visit here
Module translator
uses the mbart-large-50-many-to-many
to perform multilingual translation. It can translate from 50 languages to 50 languages.
translator
: Moduletranslator
uses thembart-large-50-many-to-many
to perform multilingual translation. It can translate from 50 languages to 50 languages.- Alternate name:
- Input:
text
- Text to be translated. Type:Union[List[str], str]
src_lang
- Source language of the text. Type:str
tgt_lang
- Target language of the text. Type:str
- Return: Return type of the action is
List[str]
.
Example JAC Code to translate text from Hindi to English:
walker test_translate_hindi_eng {
can translator.translate;
report translator.translate("नमस्ते, आप कैसे हैं?", "hi_IN", "en_XX"); # Returns ["Hello, how are you?"]
}
Example JAC Code to translate text from English to German:
walker test_translate_eng_german {
can translator.translate;
report translator.translate("Hello, how are you?", "en_XX", "de_DE"); # Returns ["Hallo, wie geht es dir?"]
}
For a complete example visit here
pdf_ext
module implemented for the Topical Change Detection in Documents via Embeddings of Long Sequences.
extract_pdf
: gets different topics in the context provided, given a threshold- Input
url
(String): gets the pdf from URLpath
(Float): gets the pdf from local pathmetadata
(Bool): to display available metadata of PDF
- Returns: a json with number of pages the pdf had and content
- Input
walker pdf_ext_example {
has url = "http://www.africau.edu/images/default/sample.pdf";
has metadata = true;
can pdf_ext.extract_pdf;
# Getting the dat from PDF
resp_data = pdf_ext.extract_pdf(url=url,
metadata=metadata);
std.out(resp_data);
}
For a complete example visit here