Wrote a convenience function for getting variable names from formula #171

compumetrika · 2021-05-15T14:56:00Z

I am using patsy as a key dependency in a stats project, and I found myself needing to identify which variables are categorical after constructing a dataframe using patsy formulas.

After an attempt using regexps ("...now you have two problems..."), I read Model specification for experts and computers a few times, and spent a lot of time poking around in X.design_info (where y, X=dmatrices(formula, data, return_type='dataframe')). Thankfully I ended up with something much shorter and more robust than my regexps attempt.

I have two questions:

I'm still not sure if I've used the interiors details of X.design_info correctly -- it does what I want but there are places where multiple things provide the same info. I'd love to have someone "in the know" look at the function and tell me if I should make a different choice. Is there a way to do this? (Counting comments the function is ~60 lines; not counting comments it is about 30 lines).
Is there any interest in having something like this contributed back to the project? I've commented and unit tested the function already, and happy to make sure final comments/tests conform to your norms & standards. I skimmed the issues before posting, and for example it appears this issue patsy equivalent of R's all.vars #155: patsy equivalent of R's all.vars might benefit from my function (not exactly the same but perhaps close enough).

The text was updated successfully, but these errors were encountered:

compumetrika · 2021-05-17T04:43:24Z

Here's the function, in case anyone finds it useful:

# Find all categorical variables:
def get_variable_types_names(X):
    # Given a RHS dataset X produced by dmatrices:
    #    y, X=dmatrices(formula, data, return_type='dataframe')
    # extract the "formula names" for all variables,
    # eg. "C(variable),"
    # and for each formula name, extract whether 
    # it is numerical or categorical. If categorical, 
    # extract all actual variable names and save them.
    # 
    # Return a dict that has the structure:
    #     
    # variable_types_names = {'numerical':[ <list of numerical variable names> ],
    #                         'categorical':{<dict of categorical names, where 
    #                                        the keys `prime name` of each 
    #                                        category, and vals are the 
    #                                        actual corresponding variable
    #                                        names from list(X)}}

    remaining_X_names = list(X)  # eventually check which of these are not selected
    if 'Intercept' in remaining_X_names:
        remaining_X_names.remove('Intercept')
    variable_types_names = {}

    # First get all formula names:
    all_formula_evals = [key for key in X.design_info.factor_infos.keys()] 
    all_formula_names = [key.name() for key in all_formula_evals] # string human readable-type

        # Note that this appears to exclude the interaction variables, but not function-ified 
        # variables. Not sure why, will need to examine

    # Now loop over all formula_names and extract:
    #     - type (numerical, categorical, or something else)
    #     - if categorical, the sub-categories need to be extracted as a list
    #     - otherwise if numerical, the sub-category is the same as the name
    for i in range(len(all_formula_names)):
        var_eval_key = all_formula_evals[i]
        var_name = all_formula_names[i]

        # Extract the type; if haven't set this type up in
        # variable_types_names yet, set it up:
        vartype = X.design_info.factor_infos[var_eval_key].type
        if vartype not in variable_types_names.keys():
            if vartype == 'categorical':
                variable_types_names[vartype] = {}  # For each key need to save sub-types
            else:
                variable_types_names[vartype] = []

        # If type is 'categorical', grab all sub-types:
        if vartype == 'categorical':
            slicer = X.design_info.slice(var_name)
            temp_varnames = X.design_info.column_names[slicer]
            variable_types_names[vartype][var_name] = temp_varnames
        else:
            variable_types_names[vartype].append(var_name)
            temp_varnames = [var_name]

        # Finally, remove the variables that have been saved
        [remaining_X_names.remove(var) for var in temp_varnames]

    # At the end of the loop, assign all remaining to ...numerical, if exists:
    for var in remaining_X_names:
        slicer = X.design_info.slice(var)
        temp_varnames = X.design_info.column_names[slicer]
        if len(temp_varnames) > 1:
            warnings.warn("There is more than 1 sub-cateogry of var for variable "+str(var)+": "+
                          str(temp_varnames))
        # now add to numerical, I suppose...
        # TODO: figure out better check and etc...
        if 'numerical' not in variable_types_names.keys():
            variable_types_names['numerical'] = [var for var in temp_varnames]
        else:
            variable_types_names['numerical'] += [var for var in temp_varnames]

    return variable_types_names




def test_get_variable_types_names():
    '''
    Build a test dataset and confirm that get_variable_types_names
    does what we want.
    '''

    # Set up some data to have categorical, numerical, and interaction
    data = {'a':['a1','a2','a3','a1','a1'],
            'x':[1,2,3,4,5],
            'z':[0.5,0.5,0.5,0.5,0.5],
            'y':[5.3,5.2,5.25,5.7,5.9]}
    df = pd.DataFrame(data)
    
    # Construct the y, X values:
    formula = 'y ~ x*z + C(a) + np.sqrt(x)'
    y, X = dmatrices(formula, data=df, return_type='dataframe')

    # Get the variable names:
    variable_types_names = get_variable_types_names(X)
    
    # Compare to what we *should* have:
    expected_var_types_names = {'numerical':['x','z','x:z', 'np.sqrt(x)'],
                                'categorical':{'C(a)':['C(a)[T.a2]',
                                                       'C(a)[T.a3]']}} 
    # Now test that they are the same:
    all_equal = set(expected_var_types_names.keys()) == set(variable_types_names.keys())
    for key in expected_var_types_names.keys():
        vals_equal = set(expected_var_types_names[key]) == set(variable_types_names[key])
        all_equal = all_equal & vals_equal
        
    return all_equal, variable_types_names, expected_var_types_names

compumetrika mentioned this issue May 17, 2021

patsy equivalent of R's all.vars #155

Open

RoelVerbelen mentioned this issue Feb 24, 2024

get_variables_names() in class ModelStatsmodels does not return all variables which causes errors vincentarelbundock/pymarginaleffects#91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrote a convenience function for getting variable names from formula #171

Wrote a convenience function for getting variable names from formula #171

compumetrika commented May 15, 2021 •

edited

Loading

compumetrika commented May 17, 2021

Wrote a convenience function for getting variable names from formula #171

Wrote a convenience function for getting variable names from formula #171

Comments

compumetrika commented May 15, 2021 • edited Loading

compumetrika commented May 17, 2021

compumetrika commented May 15, 2021 •

edited

Loading