Feature Selection on Dirac

Device: Dirac-1

Introduction

In machine learning problems, we often have to start with a large number of features. We need a feature selection technique that can discover a relatively small subset of the most relevant features. In what follows, we present a tutorial on using QCi's technology to select a set of features by minimizing their inter-correlation. This is important because highly correlated features will, by definition, contain a large amount of redundant information. This approach can be used with any unsupervised machine learning approach, such as anomaly detection and clustering algorithms. Unlike QBoost this is a pre-processing method for other tools, rather than an entire algorithm in itself. The problem in QBoost is to build the best classifier without constraint on the number of weak classifiers. The problem here is rather "which kk features give the most independent information about these data?".

Importance

Within machine learning, a well known issue is the "curse of dimensionality", a space of data points with coordinates defined by different features. The dimension of this space grows with the number of features. For a human it is difficult to visualize beyond two or three dimensions, and although computers can do better by understanding high dimensional models, added dimensions still makes the problems more difficult to approach numerically. One particular problem which occurs especially when the number of dimensions exceeds the number of samples is that it is very easy to overfit. In other words, it can learn features that are actually just a statistical accident for the samples used, rather than a real underlying structure. Not every dimension is going to be equally useful in understanding the data, and some may contain effectively the same information. In a case where two features contain essentially the same information, then it only makes sense to keep one, even if the information is very important. When more than two are present, finding which to keep can become a complex optimization problem, of exactly the type our Dirac-1 device has been built to solve.

Applications

Feature selection has a number of applications in most areas where machine learning can be applied. We choose a medicare prescription dataset for the example, but many other applications could be studied as well. One specific example is image classification, since there are nearly limitless types of data that can be extracted from an image and deciding which features can be useful is a highly important task. For example, in this work, it was found that only a small fraction of statistical features of an aerial image were relevant for determining if it contained a house, making feature selection very important for successful identification. A very different application is disease prediction based on genetic markers. In this case, feature selection is extremely important because genetic information contains a huge number of features, but it is difficult to obtain a large sample size for these kinds of studies. Naively using many more features than the number of samples is likely to make learning highly prone to overfitting. Another very different application is industrial fault diagnosis. As with the other cases, data from various sensors in an industrial facility (a chemical plant for example) is plentiful. Selecting the correct features in these data to focus analysis on is therefore crucial, particularly when trying to distinguish between faults which are similar, or highly correlated.

Methodology

Let us have a dataset with NN samples and dd features, represented by a N×dN \times d matrix FF. Moreover, let us each column of FF by fj\bf{f}_{j} where j1,...,dj \in {1,...,d}, representing all samples for feature jj. We can choose a subset of features of size dd^\prime (d<dd^\prime < d) such that the inter-correlation of the subset is minimal. We have

minxi=1dj=1dCijxixj\min_{\bf{x}} \sum_{i=1}^{d} \sum_{j=1}^{d} C_{ij} x_{i} x_{j}

where

Cij=corr(fi,fj)C_{ij} = |corr(\bf{f_{i}}, \bf{f_{j}})|

where corrcorr denotes a correlation function such as the Pearson correlation, and xi{0,1}x_{i} \in \{0, 1\} is a binary variable indicating inclusion or exclusion of feature ii. Obviously, Cii=1C_{ii} = 1. The above minimization problem is subject to a constraint,

i=1dxi=d\sum_{i=1}^{d} x_{i} = d^\prime

We can exclude the diagonal elements of CC as they always add up to dd^\prime. In the matrix form we have,

minxxT(CI)x\min_{\bf{x}} \bf{{x}^{T}} (C - I) \bf{{x}}

subject to the above constraint. Note that II is a d×dd \times d identity matrix. Note too that we have assumed that the reduced dimension dd^\prime is assumed to be given in the above approach.

Medicare Prescription Data

We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber

Clean data

We start by cleaning the dataset.

In [1]:

  • import pandas as pd
  • # Input
  • INP_FILE = "Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv"
  • OUT_FILE = "cleaned_medicare_data.csv"
  • CON_VARS = [
  • "total_claim_count",
  • "total_30_day_fill_count",
  • "total_drug_cost",
  • "total_day_supply",
  • "bene_count",
  • "total_claim_count_ge65",
  • "total_30_day_fill_count_ge65",
  • "total_drug_cost_ge65",
  • "total_day_supply_ge65",
  • "bene_count_ge65",
  • "brand_claim_count",
  • "brand_drug_cost",
  • "generic_claim_count",
  • "generic_drug_cost",
  • "other_claim_count",
  • "other_drug_cost",
  • "mapd_claim_count",
  • "mapd_drug_cost",
  • "pdp_claim_count",
  • "pdp_drug_cost",
  • "lis_claim_count",
  • "lis_drug_cost",
  • "nonlis_claim_count",
  • "nonlis_drug_cost",
  • "opioid_claim_count",
  • "opioid_drug_cost",
  • "opioid_day_supply",
  • "opioid_bene_count",
  • "opioid_prescriber_rate",
  • "antibiotic_claim_count",
  • "antibiotic_drug_cost",
  • "antibiotic_bene_count",
  • "hrm_claim_count_ge65",
  • "hrm_drug_cost_ge65",
  • "hrm_bene_count_ge65",
  • "antipsych_claim_count_ge65",
  • "antipsych_drug_cost_ge65",
  • "antipsych_bene_count_ge65",
  • "average_age_of_beneficiaries",
  • "beneficiary_age_less_65_count",
  • "beneficiary_age_65_74_count",
  • "beneficiary_age_75_84_count",
  • "beneficiary_age_greater_84_count",
  • "beneficiary_female_count",
  • "beneficiary_male_count",
  • "beneficiary_race_white_count",
  • "beneficiary_race_black_count",
  • "beneficiary_race_asian_pi_count",
  • "beneficiary_race_hispanic_count",
  • "beneficiary_race_nat_ind_count",
  • "beneficiary_race_other_count",
  • "beneficiary_nondual_count",
  • "beneficiary_dual_count",
  • "beneficiary_average_risk_score",
  • ]
  • VALID_PROVIDER_MI = [
  • "A",
  • "M",
  • "J",
  • "L",
  • "R",
  • "S",
  • "E",
  • "D",
  • "C",
  • "B",
  • "K",
  • "P",
  • "W",
  • "H",
  • "T",
  • "G",
  • "F",
  • "N",
  • "V",
  • "I",
  • "O",
  • "Y",
  • "Z",
  • "U",
  • "Q",
  • "X",
  • ]
  • VALID_GEN = ["F", "M", "Other", "Unknown"]
  • VALID_ENTITIES = ["I", "O"]
  • VALID_DESC_FLAGS = ["S", "T"]
  • VALID_ENROLLS = ["E", "N", "O"]
  • # Some utilities
  • def convert_to_float(x):
  • try:
  • return float(x)
  • except:
  • return None
  • def convert_to_int(x):
  • try:
  • return int(float(x))
  • except:
  • return None
  • # Read data
  • df = pd.read_csv(INP_FILE, on_bad_lines = "skip", low_memory=False)
  • # Clean categorical variables
  • df["nppes_provider_mi"] = df["nppes_provider_mi"].fillna("Unknown")
  • df["nppes_provider_mi"] = df["nppes_provider_mi"].apply(
  • lambda x: x if x in VALID_PROVIDER_MI else "Unknown"
  • )
  • df["nppes_credentials"] = df["nppes_credentials"].fillna("Unknown")
  • df["nppes_credentials"] = df["nppes_credentials"].apply(
  • lambda x: str(x).replace(".", "")
  • )
  • cred_hash = {
  • "MEDICAL DOCTOR": "MD",
  • "NURSE PRACTITIONER": "NP",
  • }
  • df["nppes_credentials"] = df["nppes_credentials"].apply(
  • lambda x: cred_hash[x] if x in cred_hash else x,
  • )
  • df["nppes_provider_gender"] = df["nppes_provider_gender"].fillna("Unknown")
  • df["nppes_provider_gender"] = df["nppes_provider_gender"].apply(
  • lambda x: x if x in VALID_GEN else "Other",
  • )
  • df["nppes_entity_code"] = df["nppes_entity_code"].apply(
  • lambda x: x if x in VALID_ENTITIES else "Unknown",
  • )
  • df["nppes_provider_zip5"] = df["nppes_provider_zip5"].fillna("Unknown")
  • df["nppes_provider_country"] = df["nppes_provider_country"].apply(
  • lambda x: "US" if x == "US" else "Other",
  • )
  • df["description_flag"] = df["description_flag"].apply(
  • lambda x: x if x in VALID_DESC_FLAGS else "Unknown",
  • )
  • df["medicare_prvdr_enroll_status"] = df["medicare_prvdr_enroll_status"].apply(
  • lambda x: x if x in VALID_ENROLLS else "Unknown",
  • )
  • # Treat missing beneficiary count as it cannot be zero
  • df["bene_count"] = df["bene_count"].apply(
  • convert_to_int
  • ).fillna(-1)
  • tmp_df = df.groupby(
  • "specialty_description", as_index=False,
  • )["bene_count"].mean()
  • bene_count_hash = dict(
  • zip(
  • tmp_df["specialty_description"],
  • tmp_df["bene_count"],
  • )
  • )
  • df["bene_count"] = df.apply(
  • lambda x: x["bene_count"] if x[
  • "bene_count"
  • ] > 0 else bene_count_hash[
  • x["specialty_description"]
  • ],
  • axis=1,
  • )
  • # Treat continuous variables
  • for item in CON_VARS:
  • df[item] = df[item].apply(
  • convert_to_float
  • ).fillna(0.0)
  • # Filter out invalid states
  • df = df[
  • ~df["nppes_provider_state"].isin(
  • ["XX", "E", "N", "S"]
  • )
  • ]
  • # Output
  • df.to_csv(OUT_FILE, index=False)

Generate features

We then generate features. The categorical features are encoded using the average value of a few important variables in each category.

In [2]:

  • import pandas as pd
  • # Input
  • INP_FILE = "cleaned_medicare_data.csv"
  • OUT_FILE = "medicare_features.csv"
  • # Set some parameters
  • CAT_VARS = [
  • "nppes_provider_mi",
  • #"nppes_credentials", # This is rather messy, so ignoring it.
  • "nppes_provider_gender",
  • "nppes_entity_code",
  • "nppes_provider_city",
  • "nppes_provider_zip5",
  • #"nppes_provider_country", # Almost all cases are US
  • "specialty_description",
  • "medicare_prvdr_enroll_status",
  • "nppes_provider_state",
  • ]
  • CON_VARS = [
  • "total_claim_count",
  • "total_30_day_fill_count",
  • "total_drug_cost",
  • "total_day_supply",
  • "bene_count",
  • "total_claim_count_ge65",
  • "total_30_day_fill_count_ge65",
  • "total_drug_cost_ge65",
  • "total_day_supply_ge65",
  • "bene_count_ge65",
  • "brand_claim_count",
  • "brand_drug_cost",
  • "generic_claim_count",
  • "generic_drug_cost",
  • "other_claim_count",
  • "other_drug_cost",
  • "mapd_claim_count",
  • "mapd_drug_cost",
  • "pdp_claim_count",
  • "pdp_drug_cost",
  • "lis_claim_count",
  • "lis_drug_cost",
  • "nonlis_claim_count",
  • "nonlis_drug_cost",
  • "opioid_claim_count",
  • "opioid_drug_cost",
  • "opioid_day_supply",
  • "opioid_bene_count",
  • "antibiotic_claim_count",
  • "antibiotic_drug_cost",
  • "antibiotic_bene_count",
  • "hrm_claim_count_ge65",
  • "hrm_drug_cost_ge65",
  • "hrm_bene_count_ge65",
  • "antipsych_claim_count_ge65",
  • "antipsych_drug_cost_ge65",
  • "antipsych_bene_count_ge65",
  • "average_age_of_beneficiaries",
  • "beneficiary_age_less_65_count",
  • "beneficiary_age_65_74_count",
  • "beneficiary_age_75_84_count",
  • "beneficiary_age_greater_84_count",
  • "beneficiary_female_count",
  • "beneficiary_male_count",
  • "beneficiary_race_white_count",
  • "beneficiary_race_black_count",
  • "beneficiary_race_asian_pi_count",
  • "beneficiary_race_hispanic_count",
  • "beneficiary_race_nat_ind_count",
  • "beneficiary_race_other_count",
  • "beneficiary_nondual_count",
  • "beneficiary_dual_count",
  • "beneficiary_average_risk_score",
  • ]
  • # Read and clean data
  • df = pd.read_csv(INP_FILE, low_memory=False)
  • # Embed categorical features
  • embedded_cat_features = []
  • for item in CAT_VARS:
  • tmp_df = df.groupby(item, as_index=False).agg(
  • {
  • "opioid_claim_count": "mean",
  • "opioid_drug_cost": "mean",
  • "opioid_day_supply": "mean",
  • "opioid_bene_count": "mean",
  • "opioid_prescriber_rate": "mean",
  • }
  • ).rename(
  • columns={
  • "opioid_claim_count": "%s_opioid_claim_count" % item,
  • "opioid_drug_cost": "%s_opioid_drug_cost" % item,
  • "opioid_day_supply": "%s_opioid_day_supply" % item,
  • "opioid_bene_count": "%s_opioid_bene_count" % item,
  • "opioid_prescriber_rate": "%s_opioid_prescriber_rate" % item,
  • }
  • )
  • df = df.merge(tmp_df, how="left", on=item)
  • embedded_cat_features += [
  • "%s_opioid_claim_count" % item,
  • "%s_opioid_drug_cost" % item,
  • "%s_opioid_day_supply" % item,
  • "%s_opioid_bene_count" % item,
  • "%s_opioid_prescriber_rate" % item,
  • ]
  • # Drop unembedded categorical variables and some others
  • df = df[["npi"] + CON_VARS + embedded_cat_features]
  • # Write features file
  • df.to_csv(OUT_FILE, index=False)

Feature Selection

Once the features are generated, we can implement the above-mentioned feature selection algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe.

In [3]:

  • # Import libs
  • import sys
  • import os
  • import time
  • import numpy as np
  • import pandas as pd
  • from qci_client import QciClient
  • # Define some parameters
  • FEATURES_FILE = "medicare_features.csv"
  • REDUCED_DIM = 10
  • # Read features
  • df = pd.read_csv(FEATURES_FILE, low_memory=False)

We now print the feature names and get the total count of features in the dataset,

In [4]:

  • feature_names = list(set(df.columns) - {"npi"})
  • orig_dim = len(feature_names)
  • print(
  • "Original dimension is %d; reduced dimension will be %d" % (
  • orig_dim,
  • REDUCED_DIM,
  • )
  • )

Out [ ]:

Original dimension is 93; reduced dimension will be 10

We should now create the objective matrix CC,

In [5]:

  • # Generate the objective matrix
  • X = np.array(df[feature_names])
  • C = abs(np.corrcoef(X, rowvar=False))
  • # Make correlation symmetric to machine precision
  • C = 0.5 * (C + C.transpose())
  • objective = C - np.eye(orig_dim)
  • objective = np.array(objective, dtype=np.float32)

And create the constraint matrix,

In [6]:

  • # Generate the constraint
  • cons_lhs = np.ones(shape=(orig_dim), dtype=np.float32)
  • cons_lhs = cons_lhs
  • cons_rhs = np.array([-REDUCED_DIM])
  • constraints = np.hstack([cons_lhs, cons_rhs])
  • print(constraints.shape)

Out [ ]:

(94,)

We now solve the above quadratic binary problem using QCi's Dirac-1,

In [7]:

  • token = "your_token"
  • api_url = "https://api.qci-prod.com"
  • qci = QciClient(api_token=token, url=api_url)

In [8]:

  • # Create json objects
  • objective_json = {
  • "file_name": "objective_tutorial.json",
  • "file_config": {
  • "objective": {"data": objective, "num_variables": orig_dim},
  • }
  • }
  • constraint_json = {
  • "file_name": "constraints_tutorial.json",
  • "file_config": {
  • "constraints": {
  • "data": constraints,
  • #"num_variables": orig_dim,
  • #"num_constraints": 1,
  • }
  • }
  • }
  • # Solve the optimizzation problem
  • #qci = QciClient()
  • objective_file_id = qci.upload_file(file=objective_json)["file_id"]
  • constraint_file_id = qci.upload_file(file=constraint_json)["file_id"]
  • # Setup job json
  • job_params = {
  • "device_type": "dirac-1",
  • "alpha": 5.0,
  • "num_samples": 20,
  • }
  • body = qci.build_job_body(
  • job_type="sample-constraint",
  • job_params=job_params,
  • constraints_file_id=constraint_file_id,
  • objective_file_id=objective_file_id,
  • job_name=f"tutorial_eqc1",
  • job_tags=["tutorial_eqc1"],
  • )
  • # Run the job
  • job_response_json = qci.process_job(job_body=body)
  • print(job_response_json)
  • results = job_response_json["results"]
  • energies = results["energies"]
  • samples = results["solutions"]
  • is_feasibles = results["feasibilities"]
  • if True:
  • print("Energies:", energies)
  • # Pick a feasible solution with lowest energy
  • # The sample solutions are sorted by energy
  • sol = None
  • for i, item in enumerate(samples):
  • sol = item
  • is_feasible = is_feasibles[i]
  • if is_feasible:
  • break
  • if not is_feasible:
  • print("Solution is not feasible!")
  • assert sol is not None, "No feasible solution found!"
  • assert len(sol) == orig_dim, "Inconsistent solution size!"
  • assert sum(sol) == REDUCED_DIM, "Solution is not feasible!"

Out [ ]:

2024-05-08 10:28:12 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 10:28:12 - Job submitted: job_id='663bb62cd448b017e54f94bd'
2024-05-08 10:28:12 - QUEUED
2024-05-08 10:28:15 - RUNNING
2024-05-08 10:48:23 - COMPLETED
2024-05-08 10:48:26 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663bb62cd448b017e54f94bd', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_linearly_constrained_binary_optimization': {'constraints_file_id': '663bb62c98263204a3657526', 'objective_file_id': '663bb62b98263204a3657524', 'alpha': 5, 'atol': 1e-10}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T17:28:12.687Z', 'queued_at_rfc3339nano': '2024-05-08T17:28:12.689Z', 'running_at_rfc3339nano': '2024-05-08T17:28:13.046Z', 'completed_at_rfc3339nano': '2024-05-08T17:48:21.532Z'}, 'job_result': {'file_id': '663bbae598263204a3657528', 'device_usage_s': 1128}}, 'status': 'COMPLETED', 'results': {'counts': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'energies': [-497.8916331699111, -497.8896800449111, -497.8896800449111, -497.8887034824111, -497.8847972324111, -497.8847972324111, -497.8847972324111, -497.8838206699111, -497.8789378574111, -497.8623362949111, -497.8584300449111, -497.8515941074111, -497.8291331699111, -497.8125316074111, -497.7988597324111, -497.7949534824111, -497.7773753574111, -497.7773753574111, -497.7724925449111, -497.7724925449111], 'feasibilities': [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], 'objective_values': [2.108758625341579, 2.109875851077959, 2.109875851077959, 2.1111518519464876, 2.1151239282917222, 2.1151239282917227, 2.115212611388415, 2.1162998459767546, 2.121476054191589, 2.13726658251835, 2.1418038606643672, 2.148616509046405, 2.1707787389168516, 2.1876401392510156, 2.2009708418045193, 2.2050841469317675, 2.2229102615965526, 2.2229102615965526, 2.227802827401319, 2.227861546212807], 'solutions': [[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]}}
Energies: [-497.8916331699111, -497.8896800449111, -497.8896800449111, -497.8887034824111, -497.8847972324111, -497.8847972324111, -497.8847972324111, -497.8838206699111, -497.8789378574111, -497.8623362949111, -497.8584300449111, -497.8515941074111, -497.8291331699111, -497.8125316074111, -497.7988597324111, -497.7949534824111, -497.7773753574111, -497.7773753574111, -497.7724925449111, -497.7724925449111]

Finally, we can print the list of selected variables,

In [9]:

  • selected_vars = []
  • for i in range(orig_dim):
  • if sol[i] > 0:
  • selected_vars.append(feature_names[i])
  • print(selected_vars)

Out [ ]:

['nppes_provider_city_opioid_drug_cost', 'beneficiary_race_black_count', 'nppes_provider_gender_opioid_drug_cost', 'specialty_description_opioid_drug_cost', 'nppes_provider_mi_opioid_drug_cost', 'nppes_entity_code_opioid_prescriber_rate', 'beneficiary_race_nat_ind_count', 'antipsych_drug_cost_ge65', 'beneficiary_average_risk_score', 'beneficiary_race_asian_pi_count']

Conclusion

This tutorial is an example of an application of quadratic linearly-constrained binary optimization. A tutorial of which is available here and can give a number of other applications of this formalism. Another direction is to examine what can be done with the unconstrained version. Of course, you could also try applying Dirac to one of your own problems.

Like portfolio optimization, QBoost, and dimensionality reduction, this tutorial is a variation on the same theme of taking advantage of the correlation structure of an underlying data set. Such problems keep arising both because they have important applications, and because they are naturally expressed as QUBOs, given the importance of two-body correlations. Trying one of these tutorials is a natural next step.