You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current scoring algorithm is basic addition. It takes all tags associated with a job (location, company, commitment, skills, source) and +1s each. Your matches are just order by aggregate_score desc. This works fine, but I'm thinking content-based filtering would perform better. Which similarity model to use? Jaccard if we store jobs as sparse has-tag/no-tag vectors and users as likes-tag/dislikes-tag. Maybe cosine or pearson if we store the strength of tag-representation. Alternatively explore linear regression w/ online stochastic gradient descent.
Sample case it could perform better: employer's candidate matches gets topped by spammers ("I know everything under the sun - I just want a job") rather than one who overlaps their specific skills match (eg cosine similarity)
Sample case where it might perform worse: user is interested in various locations. Might the location tag confuse the similarity algo? It might stay OK if we keep locations as simple string tags rather than locational-awareness tags #1
We don't want collaborative filtering, because that assumes assume latent features to be learned in the jobs/users based on similar users' preferences (right?). We actually have the job features, scraped from the content or provided during custom job creation. Aka this should act more like Pandora than Netflix.
I've looked into a couple technologies for this. Since we're using Postgres, plus more jobs/features than fit into server memory, we don't want file-and-memory-friendly libraries like scikit-learn. Instead we want something SQL-compatible & scalable. I'm especially looking at MadLib and Spark SQL + MLlib. The latter combo is more popular than the former; however, for such a simple task as linear regression; and since we're already using Postgres (just CREATE EXTENSION ....) it seems MadLib would be a simpler solution. Thoughts?
Current scoring algorithm is basic addition. It takes all tags associated with a job (location, company, commitment, skills, source) and
+1
s each. Your matches are justorder by aggregate_score desc
. This works fine, but I'm thinking content-based filtering would perform better. Which similarity model to use? Jaccard if we store jobs as sparse has-tag/no-tag vectors and users as likes-tag/dislikes-tag. Maybe cosine or pearson if we store the strength of tag-representation. Alternatively explore linear regression w/ online stochastic gradient descent.Sample case it could perform better: employer's candidate matches gets topped by spammers ("I know everything under the sun - I just want a job") rather than one who overlaps their specific skills match (eg cosine similarity)
Sample case where it might perform worse: user is interested in various locations. Might the location tag confuse the similarity algo? It might stay OK if we keep locations as simple string tags rather than locational-awareness tags #1
We don't want collaborative filtering, because that assumes assume latent features to be learned in the jobs/users based on similar users' preferences (right?). We actually have the job features, scraped from the content or provided during custom job creation. Aka this should act more like Pandora than Netflix.
I've looked into a couple technologies for this. Since we're using Postgres, plus more jobs/features than fit into server memory, we don't want file-and-memory-friendly libraries like scikit-learn. Instead we want something SQL-compatible & scalable. I'm especially looking at MadLib and Spark SQL + MLlib. The latter combo is more popular than the former; however, for such a simple task as linear regression; and since we're already using Postgres (just
CREATE EXTENSION ....
) it seems MadLib would be a simpler solution. Thoughts?Link dump (will parse later): 1 2 3 4 5 6
The text was updated successfully, but these errors were encountered: