Skip to content

Scalable Distributed LDA implementation for Spark & Glint

License

Notifications You must be signed in to change notification settings

mrikitoku/glintlda

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Glint LDA

Scalable Distributed LDA implementation for Spark & Glint

This implementation is based on LightLDA.

Usage

Make sure you have Glint running, for a simple localhost test with 2 servers, you can locally run Glint as follows:

sbt "run master"
sbt "run server"
sbt "run server"

Next, load in a dataset in Spark with an RDD:

// Preprocessing of data ...
// End result should be an RDD of breeze sparse vectors that represent bag-of-words term frequency vectors
rdd = sc.textFile(...).map(x => SparseVector[Int](...))

Construct the Glint client that acts as an interface to the running parameter servers

// Open glint client with a path to a specific configuration file
val gc = Client(ConfigFactory.parseFile(new java.io.File(configFile)))

Set the LDA parameters and call the fitMetropolisHastings function to run the LDA algorithm

// LDA topic model with 100,000 terms and 100 topics
val ldaConfig = new LDAConfig()
ldaConfig.setα(0.5)
ldaConfig.setβ(0.01)
ldaConfig.setTopics(100)
ldaConfig.setVocabularyTerms(100000)
val model = Solver.fitMetropolisHastings(sc, gc, rdd, ldaConfig, 100)

About

Scalable Distributed LDA implementation for Spark & Glint

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 100.0%