Feature Selection on Dirac

Device: Dirac-1

Introduction

In machine learning problems, we often have to start with a large number of features. We need a feature selection technique that can discover a relatively small subset of the most relevant features. In what follows, we present a tutorial on using QCi's technology to select a set of features by minimizing their inter-correlation. This is important because highly correlated features will, by definition, contain a large amount of redundant information. This approach can be used with any unsupervised machine learning approach, such as anomaly detection and clustering algorithms. Unlike QBoost this is a pre-processing method for other tools, rather than an entire algorithm in itself. The problem in QBoost is to build the best classifier without constraint on the number of weak classifiers. The problem here is rather "which $k$ $k$ features give the most independent information about these data?".

Importance

Within machine learning, a well known issue is the "curse of dimensionality", a space of data points with coordinates defined by different features. The dimension of this space grows with the number of features. For a human it is difficult to visualize beyond two or three dimensions, and although computers can do better by understanding high dimensional models, added dimensions still makes the problems more difficult to approach numerically. One particular problem which occurs especially when the number of dimensions exceeds the number of samples is that it is very easy to overfit. In other words, it can learn features that are actually just a statistical accident for the samples used, rather than a real underlying structure. Not every dimension is going to be equally useful in understanding the data, and some may contain effectively the same information. In a case where two features contain essentially the same information, then it only makes sense to keep one, even if the information is very important. When more than two are present, finding which to keep can become a complex optimization problem, of exactly the type our Dirac-1 device has been built to solve.

Applications

Feature selection has a number of applications in most areas where machine learning can be applied. We choose a medicare prescription dataset for the example, but many other applications could be studied as well. One specific example is image classification, since there are nearly limitless types of data that can be extracted from an image and deciding which features can be useful is a highly important task. For example, in this work, it was found that only a small fraction of statistical features of an aerial image were relevant for determining if it contained a house, making feature selection very important for successful identification. A very different application is disease prediction based on genetic markers. In this case, feature selection is extremely important because genetic information contains a huge number of features, but it is difficult to obtain a large sample size for these kinds of studies. Naively using many more features than the number of samples is likely to make learning highly prone to overfitting. Another very different application is industrial fault diagnosis. As with the other cases, data from various sensors in an industrial facility (a chemical plant for example) is plentiful. Selecting the correct features in these data to focus analysis on is therefore crucial, particularly when trying to distinguish between faults which are similar, or highly correlated.

Methodology

Let us have a dataset with $N$ $N$ samples and $d$ $d$ features, represented by a $N \times d$ $N \times d$ matrix $F$ $F$ . Moreover, let us each column of $F$ $F$ by $\bf{f}_{j}$ $f_{j}$ where $j \in {1,...,d}$ $j \in 1, ..., d$ , representing all samples for feature $j$ $j$ . We can choose a subset of features of size $d^\prime$ $d^{'}$ ( $d^\prime < d$ $d^{'} < d$ ) such that the inter-correlation of the subset is minimal. We have

\min_{\bf{x}} \sum_{i=1}^{d} \sum_{j=1}^{d} C_{ij} x_{i} x_{j}

min_{x} \sum_{i = 1}^{d} \sum_{j = 1}^{d} C_{ij} x_{i} x_{j}

where

C_{ij} = |corr(\bf{f_{i}}, \bf{f_{j}})|

C_{ij} = ∣ corr (f_{i}, f_{j}) ∣

where $corr$ $corr$ denotes a correlation function such as the Pearson correlation, and $x_{i} \in \{0, 1\}$ $x_{i} \in {0, 1}$ is a binary variable indicating inclusion or exclusion of feature $i$ $i$ . Obviously, $C_{ii} = 1$ $C_{ii} = 1$ . The above minimization problem is subject to a constraint,

\sum_{i=1}^{d} x_{i} = d^\prime

\sum_{i = 1}^{d} x_{i} = d^{'}

We can exclude the diagonal elements of $C$ $C$ as they always add up to $d^\prime$ $d^{'}$ . In the matrix form we have,

\min_{\bf{x}} \bf{{x}^{T}} (C - I) \bf{{x}}

min_{x} x^{T} (C - I) x

subject to the above constraint. Note that $I$ $I$ is a $d \times d$ $d \times d$ identity matrix. Note too that we have assumed that the reduced dimension $d^\prime$ $d^{'}$ is assumed to be given in the above approach.

Medicare Prescription Data

We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber

Clean data

We start by cleaning the dataset.

In [1]:

import pandas as pd
# Input                                                                                       
INP_FILE = "Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv"
OUT_FILE  = "cleaned_medicare_data.csv"
CON_VARS = [
    "total_claim_count",
    "total_30_day_fill_count",
    "total_drug_cost",
    "total_day_supply",
    "bene_count",
    "total_claim_count_ge65",
    "total_30_day_fill_count_ge65",
    "total_drug_cost_ge65",
    "total_day_supply_ge65",
    "bene_count_ge65",
    "brand_claim_count",
    "brand_drug_cost",
    "generic_claim_count",
    "generic_drug_cost",
    "other_claim_count",
    "other_drug_cost",
    "mapd_claim_count",
    "mapd_drug_cost",
    "pdp_claim_count",
    "pdp_drug_cost",
    "lis_claim_count",
    "lis_drug_cost",
    "nonlis_claim_count",
    "nonlis_drug_cost",
    "opioid_claim_count",
    "opioid_drug_cost",
    "opioid_day_supply",
    "opioid_bene_count",
    "opioid_prescriber_rate",
    "antibiotic_claim_count",
    "antibiotic_drug_cost",
    "antibiotic_bene_count",
    "hrm_claim_count_ge65",
    "hrm_drug_cost_ge65",
    "hrm_bene_count_ge65",
    "antipsych_claim_count_ge65",
    "antipsych_drug_cost_ge65",
    "antipsych_bene_count_ge65",
    "average_age_of_beneficiaries",
    "beneficiary_age_less_65_count",
    "beneficiary_age_65_74_count",
    "beneficiary_age_75_84_count",
    "beneficiary_age_greater_84_count",
    "beneficiary_female_count",
    "beneficiary_male_count",
    "beneficiary_race_white_count",
    "beneficiary_race_black_count",
    "beneficiary_race_asian_pi_count",
    "beneficiary_race_hispanic_count",
    "beneficiary_race_nat_ind_count",
    "beneficiary_race_other_count",
    "beneficiary_nondual_count",
    "beneficiary_dual_count",
    "beneficiary_average_risk_score",
]
VALID_PROVIDER_MI = [
    "A",
    "M",
    "J",
    "L",
    "R",
    "S",
    "E",
    "D",
    "C",
    "B",
    "K",
    "P",
    "W",
    "H",
    "T",
    "G",
    "F",
    "N",
    "V",
    "I",
    "O",
    "Y",
    "Z",
    "U", 
    "Q",
    "X",
]
VALID_GEN = ["F", "M", "Other", "Unknown"]
VALID_ENTITIES = ["I", "O"]
VALID_DESC_FLAGS = ["S", "T"]
VALID_ENROLLS = ["E", "N", "O"]
# Some utilities                                                                              
def convert_to_float(x):
    try:
        return float(x)
    except:
        return None
    
def convert_to_int(x):
    try:
        return int(float(x))
    except:
        return None
# Read data                                                                                   
df = pd.read_csv(INP_FILE, on_bad_lines = "skip", low_memory=False)
    
# Clean categorical variables                                                                 
df["nppes_provider_mi"] = df["nppes_provider_mi"].fillna("Unknown")
df["nppes_provider_mi"] = df["nppes_provider_mi"].apply(
    lambda x: x if x in VALID_PROVIDER_MI else "Unknown"
)
df["nppes_credentials"] = df["nppes_credentials"].fillna("Unknown")
df["nppes_credentials"] = df["nppes_credentials"].apply(
    lambda x: str(x).replace(".", "")
)
cred_hash = {
    "MEDICAL DOCTOR": "MD",
    "NURSE PRACTITIONER": "NP",
}
df["nppes_credentials"] = df["nppes_credentials"].apply(
    lambda x: cred_hash[x] if x in cred_hash else x,
)
df["nppes_provider_gender"] = df["nppes_provider_gender"].fillna("Unknown")
df["nppes_provider_gender"] = df["nppes_provider_gender"].apply(
    lambda x: x if x in VALID_GEN else "Other",
)
df["nppes_entity_code"] = df["nppes_entity_code"].apply(
    lambda x: x if x in VALID_ENTITIES else "Unknown",
)
df["nppes_provider_zip5"] = df["nppes_provider_zip5"].fillna("Unknown")
df["nppes_provider_country"] = df["nppes_provider_country"].apply(
    lambda x: "US" if x == "US" else "Other",
)
df["description_flag"] = df["description_flag"].apply(
    lambda x: x if x in VALID_DESC_FLAGS else "Unknown",
)
df["medicare_prvdr_enroll_status"] = df["medicare_prvdr_enroll_status"].apply(
    lambda x: x if x in VALID_ENROLLS else "Unknown",
)
# Treat missing beneficiary count as it cannot be zero                                        
df["bene_count"] = df["bene_count"].apply(
    convert_to_int
).fillna(-1)
tmp_df = df.groupby(
    "specialty_description", as_index=False,
)["bene_count"].mean()
bene_count_hash = dict(
    zip(
        tmp_df["specialty_description"],
        tmp_df["bene_count"],
    )
)
df["bene_count"] = df.apply(
    lambda x: x["bene_count"] if x[
        "bene_count"
    ] > 0 else bene_count_hash[
        x["specialty_description"]
    ],
    axis=1,
)
# Treat continuous variables                                                                  
for item in CON_VARS:
    df[item] = df[item].apply(
        convert_to_float
    ).fillna(0.0)
# Filter out invalid states                                                                   
df  = df[
    ~df["nppes_provider_state"].isin(
        ["XX", "E", "N", "S"]
    )
]
# Output                                                                                      
df.to_csv(OUT_FILE, index=False)

Generate features

We then generate features. The categorical features are encoded using the average value of a few important variables in each category.

In [2]:

import pandas as pd
# Input                                                                                       
INP_FILE = "cleaned_medicare_data.csv"
OUT_FILE  = "medicare_features.csv"
# Set some parameters                                                                         
CAT_VARS = [
    "nppes_provider_mi",
    #"nppes_credentials", # This is rather messy, so ignoring it.                             
    "nppes_provider_gender",
    "nppes_entity_code",
    "nppes_provider_city",
    "nppes_provider_zip5",
    #"nppes_provider_country", # Almost all cases are US                                      
    "specialty_description",
    "medicare_prvdr_enroll_status",
    "nppes_provider_state",
]
CON_VARS = [
    "total_claim_count",
    "total_30_day_fill_count",
    "total_drug_cost",
    "total_day_supply",
    "bene_count",
    "total_claim_count_ge65",
    "total_30_day_fill_count_ge65",
    "total_drug_cost_ge65",
    "total_day_supply_ge65",
    "bene_count_ge65",
    "brand_claim_count",
    "brand_drug_cost",
    "generic_claim_count",
    "generic_drug_cost",
    "other_claim_count",
    "other_drug_cost",
    "mapd_claim_count",
    "mapd_drug_cost",
    "pdp_claim_count",
    "pdp_drug_cost",
    "lis_claim_count",
    "lis_drug_cost",
    "nonlis_claim_count",
    "nonlis_drug_cost",
    "opioid_claim_count",
    "opioid_drug_cost",
    "opioid_day_supply",
    "opioid_bene_count",
    "antibiotic_claim_count",
    "antibiotic_drug_cost",
    "antibiotic_bene_count",
    "hrm_claim_count_ge65",
    "hrm_drug_cost_ge65",
    "hrm_bene_count_ge65",
    "antipsych_claim_count_ge65",
    "antipsych_drug_cost_ge65",
    "antipsych_bene_count_ge65",
    "average_age_of_beneficiaries",
    "beneficiary_age_less_65_count",
    "beneficiary_age_65_74_count",
    "beneficiary_age_75_84_count",
    "beneficiary_age_greater_84_count",
    "beneficiary_female_count",
    "beneficiary_male_count",
    "beneficiary_race_white_count",
    "beneficiary_race_black_count",
    "beneficiary_race_asian_pi_count",
    "beneficiary_race_hispanic_count",
    "beneficiary_race_nat_ind_count",
    "beneficiary_race_other_count",
    "beneficiary_nondual_count",
    "beneficiary_dual_count",
    "beneficiary_average_risk_score",
]
# Read and clean data                                                                         
df = pd.read_csv(INP_FILE, low_memory=False)
# Embed categorical features                                                                  
embedded_cat_features = []
for item in CAT_VARS:
    tmp_df = df.groupby(item, as_index=False).agg(
        {
            "opioid_claim_count": "mean",
            "opioid_drug_cost": "mean",
            "opioid_day_supply": "mean",
            "opioid_bene_count": "mean",
            "opioid_prescriber_rate": "mean",
        }
    ).rename(
        columns={
            "opioid_claim_count": "%s_opioid_claim_count" % item,
            "opioid_drug_cost": "%s_opioid_drug_cost" % item,
            "opioid_day_supply": "%s_opioid_day_supply" % item,
            "opioid_bene_count": "%s_opioid_bene_count" % item,
            "opioid_prescriber_rate": "%s_opioid_prescriber_rate" % item,
        }
    )
    df = df.merge(tmp_df, how="left", on=item)
    embedded_cat_features += [
        "%s_opioid_claim_count" % item,
        "%s_opioid_drug_cost" % item,
        "%s_opioid_day_supply" % item,
        "%s_opioid_bene_count" % item,
        "%s_opioid_prescriber_rate" % item,
    ]
# Drop unembedded categorical variables and some others                                       
df = df[["npi"] + CON_VARS + embedded_cat_features]
# Write features file                                                                         
df.to_csv(OUT_FILE, index=False)

Feature Selection

Once the features are generated, we can implement the above-mentioned feature selection algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe.

In [3]:

# Import libs
import sys
import os
import time
import numpy as np
import pandas as pd
from qci_client import QciClient
# Define some parameters
FEATURES_FILE = "medicare_features.csv"
REDUCED_DIM = 10
# Read features
df = pd.read_csv(FEATURES_FILE, low_memory=False)

We now print the feature names and get the total count of features in the dataset,

In [4]:

feature_names = list(set(df.columns) - {"npi"})
orig_dim = len(feature_names)
print(
    "Original dimension is %d; reduced dimension will be %d" % (
        orig_dim,
        REDUCED_DIM,
    )
)

Out [ ]:

Original dimension is 93; reduced dimension will be 10

We should now create the objective matrix $C$ $C$ ,

In [5]:

# Generate the objective matrix
X = np.array(df[feature_names])
C = abs(np.corrcoef(X, rowvar=False))
# Make correlation symmetric to machine precision
C = 0.5 * (C + C.transpose())
objective = C - np.eye(orig_dim)
objective = np.array(objective, dtype=np.float32)

And create the constraint matrix,

In [6]:

# Generate the constraint
cons_lhs = np.ones(shape=(orig_dim), dtype=np.float32)
cons_lhs = cons_lhs
cons_rhs = np.array([-REDUCED_DIM])
constraints = np.hstack([cons_lhs, cons_rhs])
print(constraints.shape)

Out [ ]:

(94,)

We now solve the above quadratic binary problem using QCi's Dirac-1,

In [7]:

token = "your_token"
api_url = "https://api.qci-prod.com"
qci = QciClient(api_token=token, url=api_url)

In [8]:

# Create json objects
objective_json = {
    "file_name": "objective_tutorial.json",
    "file_config": {
        "objective": {"data": objective, "num_variables": orig_dim},
    }  
}
    
constraint_json = {
    "file_name": "constraints_tutorial.json",
    "file_config": {
        "constraints": {
            "data": constraints, 
             #"num_variables": orig_dim,
             #"num_constraints": 1,
        }
    }
}
# Solve the optimizzation problem
#qci = QciClient()
objective_file_id = qci.upload_file(file=objective_json)["file_id"]
constraint_file_id = qci.upload_file(file=constraint_json)["file_id"]
# Setup job json
job_params = {
    "device_type": "dirac-1", 
    "alpha": 5.0, 
    "num_samples": 20,
}
body = qci.build_job_body(
    job_type="sample-constraint", 
    job_params=job_params,
    constraints_file_id=constraint_file_id, 
    objective_file_id=objective_file_id,
    job_name=f"tutorial_eqc1",
    job_tags=["tutorial_eqc1"],
)
# Run the job
job_response_json = qci.process_job(job_body=body)
print(job_response_json)
results = job_response_json["results"]
energies = results["energies"]
samples = results["solutions"]
is_feasibles = results["feasibilities"]
if True:
    print("Energies:", energies)        
# Pick a feasible solution with lowest energy                                             
# The sample solutions are sorted by energy                                               
sol = None
for i, item in enumerate(samples):
    sol = item
    is_feasible = is_feasibles[i]
    if is_feasible:
        break
if not is_feasible:
    print("Solution is not feasible!")
assert sol is not None, "No feasible solution found!"
assert len(sol) == orig_dim, "Inconsistent solution size!"
assert sum(sol) == REDUCED_DIM, "Solution is not feasible!"

Out [ ]:

2024-05-08 10:28:12 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 10:28:12 - Job submitted: job_id='663bb62cd448b017e54f94bd'
2024-05-08 10:28:12 - QUEUED
2024-05-08 10:28:15 - RUNNING
2024-05-08 10:48:23 - COMPLETED
2024-05-08 10:48:26 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663bb62cd448b017e54f94bd', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_linearly_constrained_binary_optimization': {'constraints_file_id': '663bb62c98263204a3657526', 'objective_file_id': '663bb62b98263204a3657524', 'alpha': 5, 'atol': 1e-10}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T17:28:12.687Z', 'queued_at_rfc3339nano': '2024-05-08T17:28:12.689Z', 'running_at_rfc3339nano': '2024-05-08T17:28:13.046Z', 'completed_at_rfc3339nano': '2024-05-08T17:48:21.532Z'}, 'job_result': {'file_id': '663bbae598263204a3657528', 'device_usage_s': 1128}}, 'status': 'COMPLETED', 'results': {'counts': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'energies': [-497.8916331699111, -497.8896800449111, -497.8896800449111, -497.8887034824111, -497.8847972324111, -497.8847972324111, -497.8847972324111, -497.8838206699111, -497.8789378574111, -497.8623362949111, -497.8584300449111, -497.8515941074111, -497.8291331699111, -497.8125316074111, -497.7988597324111, -497.7949534824111, -497.7773753574111, -497.7773753574111, -497.7724925449111, -497.7724925449111], 'feasibilities': [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], 'objective_values': [2.108758625341579, 2.109875851077959, 2.109875851077959, 2.1111518519464876, 2.1151239282917222, 2.1151239282917227, 2.115212611388415, 2.1162998459767546, 2.121476054191589, 2.13726658251835, 2.1418038606643672, 2.148616509046405, 2.1707787389168516, 2.1876401392510156, 2.2009708418045193, 2.2050841469317675, 2.2229102615965526, 2.2229102615965526, 2.227802827401319, 2.227861546212807], 'solutions': [[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]}}
Energies: [-497.8916331699111, -497.8896800449111, -497.8896800449111, -497.8887034824111, -497.8847972324111, -497.8847972324111, -497.8847972324111, -497.8838206699111, -497.8789378574111, -497.8623362949111, -497.8584300449111, -497.8515941074111, -497.8291331699111, -497.8125316074111, -497.7988597324111, -497.7949534824111, -497.7773753574111, -497.7773753574111, -497.7724925449111, -497.7724925449111]

Finally, we can print the list of selected variables,

In [9]:

selected_vars = []
for i in range(orig_dim):
    if sol[i] > 0:
        selected_vars.append(feature_names[i])
print(selected_vars)

Out [ ]:

['nppes_provider_city_opioid_drug_cost', 'beneficiary_race_black_count', 'nppes_provider_gender_opioid_drug_cost', 'specialty_description_opioid_drug_cost', 'nppes_provider_mi_opioid_drug_cost', 'nppes_entity_code_opioid_prescriber_rate', 'beneficiary_race_nat_ind_count', 'antipsych_drug_cost_ge65', 'beneficiary_average_risk_score', 'beneficiary_race_asian_pi_count']

Conclusion

This tutorial is an example of an application of quadratic linearly-constrained binary optimization. A tutorial of which is available here and can give a number of other applications of this formalism. Another direction is to examine what can be done with the unconstrained version. Of course, you could also try applying Dirac to one of your own problems.

Like portfolio optimization, QBoost, and dimensionality reduction, this tutorial is a variation on the same theme of taking advantage of the correlation structure of an underlying data set. Such problems keep arising both because they have important applications, and because they are naturally expressed as QUBOs, given the importance of two-body correlations. Trying one of these tutorials is a natural next step.