Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statement sampling algorithm #25

Open
Tracked by #18
markwhiting opened this issue Apr 21, 2023 · 2 comments
Open
Tracked by #18

Statement sampling algorithm #25

markwhiting opened this issue Apr 21, 2023 · 2 comments
Assignees

Comments

@markwhiting
Copy link
Member

markwhiting commented Apr 21, 2023

We want to build an algorithm that can quickly choose the next statement for someone to rate. There's a few things this algorithm could prioritize so we should consider which ones we care about most. There are also a few different levels of algorithm which imply dramatically different computational loads, so that should also be a consideration.

Possible optimization considerations:

  1. maximize knowledge about each respondent
  2. maximize knowledge about each statement
  3. maximize knowledge about types of respondent
  4. maximize knowledge about types of statement
  5. maximize coverage over all statements
  6. maximize predictive accuracy on new ratings (i.e., collect data about the statements we believe we have the least predictive accuracy on)
  7. maximize predictive accuracy over statements worth predicting (i.e., statements that are gibberish should be algorithmically avoided).

And some of the levels of algorithm might be:

  1. random or block random
  2. optimizing coverage over the statements (i.e., statements are randomly chosen with a weighting that is inverse to the number of times they have already been rated)
  3. optimizing variance within or across people or statements (this can be extended by blocking on statement or person data and other things of course)
  4. training a model and using it or its behavior to inform task selection
  5. using something like a Bayesian optimization approach to choose the next statement (this is probably the most expensive)
@markwhiting
Copy link
Member Author

I think we should have a simple way to switch which approach we use and probably implement at lease a purely random, a weighted random, and a simple model based version.

@amirrr let me know if you have thoughts on any of this.

Also @JamesPHoughton please chime in if you have thoughts you think would help us or any resources you might recommend us considering.

@amirrr
Copy link
Collaborator

amirrr commented May 17, 2023

Currently using reverse weighted reservoir sampling with MySQL seems like a way to quickly get a statement

WITH weighted_questions AS (
    SELECT
    statements.id,
    statements.`statement`,
    1.0 / (COUNT(answers.statementId)+1) AS weight
    FROM
    statements
    LEFT JOIN
        answers ON statements.id = answers.statementId 
    GROUP BY
        statements.id
)
SELECT
  id,
  `statement`,
  -LOG(RAND()) / weight AS priority
FROM
  weighted_questions
ORDER BY priority ASC  
LIMIT 1;

Refrence: Randomly selecting rows based on weights

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants