Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Machine Learning #3

Open
lefnire opened this issue Mar 26, 2016 · 0 comments
Open

Add Machine Learning #3

lefnire opened this issue Mar 26, 2016 · 0 comments

Comments

@lefnire
Copy link
Owner

lefnire commented Mar 26, 2016

Current scoring algorithm is basic addition. It takes all tags associated with a job (location, company, commitment, skills, source) and +1s each. Your matches are just order by aggregate_score desc. This works fine, but I'm thinking content-based filtering would perform better. Which similarity model to use? Jaccard if we store jobs as sparse has-tag/no-tag vectors and users as likes-tag/dislikes-tag. Maybe cosine or pearson if we store the strength of tag-representation. Alternatively explore linear regression w/ online stochastic gradient descent.

Sample case it could perform better: employer's candidate matches gets topped by spammers ("I know everything under the sun - I just want a job") rather than one who overlaps their specific skills match (eg cosine similarity)

Sample case where it might perform worse: user is interested in various locations. Might the location tag confuse the similarity algo? It might stay OK if we keep locations as simple string tags rather than locational-awareness tags #1

We don't want collaborative filtering, because that assumes assume latent features to be learned in the jobs/users based on similar users' preferences (right?). We actually have the job features, scraped from the content or provided during custom job creation. Aka this should act more like Pandora than Netflix.

I've looked into a couple technologies for this. Since we're using Postgres, plus more jobs/features than fit into server memory, we don't want file-and-memory-friendly libraries like scikit-learn. Instead we want something SQL-compatible & scalable. I'm especially looking at MadLib and Spark SQL + MLlib. The latter combo is more popular than the former; however, for such a simple task as linear regression; and since we're already using Postgres (just CREATE EXTENSION ....) it seems MadLib would be a simpler solution. Thoughts?

Link dump (will parse later): 1 2 3 4 5 6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant