FoundryProductsTechnologyCompanyInvestor relationsResource libraryNews
Contact us
Resource library
    Understanding QLCBOs home
    QLCBO formulation
    Portfolio optimization on Dirac
    Quadratic Assignment on Dirac
    Feature Selection on Dirac
    Traveling Salesman on Dirac
    Set Partitioning on Dirac

Couldn’t find what you are looking for? Reach out to technical support.

Contact support
Privacy PolicyCookie PolicyTerms of UseForward Looking StatementsAccessibility Statement
Terms and Conditions of SaleEnd User License Agreement

© 2018-2026 Quantum Computing Inc.

Download

Feature Selection on Dirac

Device: Dirac-1

Introduction

In machine learning problems, we often have to start with a large number of features. We need a feature selection technique that can discover a relatively small subset of the most relevant features. In what follows, we present a tutorial on using QCi's technology to select a set of features by minimizing their inter-correlation. This is important because highly correlated features will, by definition, contain a large amount of redundant information. This approach can be used with any unsupervised machine learning approach, such as anomaly detection and clustering algorithms. Unlike QBoost this is a pre-processing method for other tools, rather than an entire algorithm in itself. The problem in QBoost is to build the best classifier without constraint on the number of weak classifiers. The problem here is rather "which kkk features give the most independent information about these data?".

Importance

Within machine learning, a well known issue is the "curse of dimensionality", a space of data points with coordinates defined by different features. The dimension of this space grows with the number of features. For a human it is difficult to visualize beyond two or three dimensions, and although computers can do better by understanding high dimensional models, added dimensions still makes the problems more difficult to approach numerically. One particular problem which occurs especially when the number of dimensions exceeds the number of samples is that it is very easy to overfit. In other words, it can learn features that are actually just a statistical accident for the samples used, rather than a real underlying structure. Not every dimension is going to be equally useful in understanding the data, and some may contain effectively the same information. In a case where two features contain essentially the same information, then it only makes sense to keep one, even if the information is very important. When more than two are present, finding which to keep can become a complex optimization problem, of exactly the type our Dirac-1 device has been built to solve.

Applications

Feature selection has a number of applications in most areas where machine learning can be applied. We choose a medicare prescription dataset for the example, but many other applications could be studied as well. One specific example is image classification, since there are nearly limitless types of data that can be extracted from an image and deciding which features can be useful is a highly important task. For example, in this work, it was found that only a small fraction of statistical features of an aerial image were relevant for determining if it contained a house, making feature selection very important for successful identification. A very different application is disease prediction based on genetic markers. In this case, feature selection is extremely important because genetic information contains a huge number of features, but it is difficult to obtain a large sample size for these kinds of studies. Naively using many more features than the number of samples is likely to make learning highly prone to overfitting. Another very different application is industrial fault diagnosis. As with the other cases, data from various sensors in an industrial facility (a chemical plant for example) is plentiful. Selecting the correct features in these data to focus analysis on is therefore crucial, particularly when trying to distinguish between faults which are similar, or highly correlated.

Methodology

Let us have a dataset with NNN samples and ddd features, represented by a N×dN \times dN×d matrix FFF. Moreover, let us each column of FFF by fj\bf{f}_{j}fj​ where j∈1,...,dj \in {1,...,d}j∈1,...,d, representing all samples for feature jjj. We can choose a subset of features of size d′d^\primed′ (d′<dd^\prime < dd′<d) such that the inter-correlation of the subset is minimal. We have

min⁡x∑i=1d∑j=1dCijxixj\min_{\bf{x}} \sum_{i=1}^{d} \sum_{j=1}^{d} C_{ij} x_{i} x_{j}minx​∑i=1d​∑j=1d​Cij​xi​xj​

where

Cij=∣corr(fi,fj)∣C_{ij} = |corr(\bf{f_{i}}, \bf{f_{j}})|Cij​=∣corr(fi​,fj​)∣

where corrcorrcorr denotes a correlation function such as the Pearson correlation, and xi∈{0,1}x_{i} \in \{0, 1\}xi​∈{0,1} is a binary variable indicating inclusion or exclusion of feature iii. Obviously, Cii=1C_{ii} = 1Cii​=1. The above minimization problem is subject to a constraint,

∑i=1dxi=d′\sum_{i=1}^{d} x_{i} = d^\prime∑i=1d​xi​=d′

We can exclude the diagonal elements of CCC as they always add up to d′d^\primed′. In the matrix form we have,

min⁡xxT(C−I)x\min_{\bf{x}} \bf{{x}^{T}} (C - I) \bf{{x}}minx​xT(C−I)x

subject to the above constraint. Note that III is a d×dd \times dd×d identity matrix. Note too that we have assumed that the reduced dimension d′d^\primed′ is assumed to be given in the above approach.

Medicare Prescription Data

We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber

Clean data

We start by cleaning the dataset.

In [1]:

  • import pandas as pd
  • # Input
  • INP_FILE = "Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv"
  • OUT_FILE = "cleaned_medicare_data.csv"
  • CON_VARS = [
  • "total_claim_count",
  • "total_30_day_fill_count",
  • "total_drug_cost",
  • "total_day_supply",
  • "bene_count",
  • "total_claim_count_ge65",
  • "total_30_day_fill_count_ge65",
  • "total_drug_cost_ge65",
  • "total_day_supply_ge65",
  • "bene_count_ge65",
  • "brand_claim_count",
  • "brand_drug_cost",
  • "generic_claim_count",
  • "generic_drug_cost",
  • "other_claim_count",
  • "other_drug_cost",
  • "mapd_claim_count",
  • "mapd_drug_cost",
  • "pdp_claim_count",
  • "pdp_drug_cost",
  • "lis_claim_count",
  • "lis_drug_cost",
  • "nonlis_claim_count",
  • "nonlis_drug_cost",
  • "opioid_claim_count",
  • "opioid_drug_cost",
  • "opioid_day_supply",
  • "opioid_bene_count",
  • "opioid_prescriber_rate",
  • "antibiotic_claim_count",
  • "antibiotic_drug_cost",
  • "antibiotic_bene_count",
  • "hrm_claim_count_ge65",
  • "hrm_drug_cost_ge65",
  • "hrm_bene_count_ge65",
  • "antipsych_claim_count_ge65",
  • "antipsych_drug_cost_ge65",
  • "antipsych_bene_count_ge65",
  • "average_age_of_beneficiaries",
  • "beneficiary_age_less_65_count",
  • "beneficiary_age_65_74_count",
  • "beneficiary_age_75_84_count",
  • "beneficiary_age_greater_84_count",
  • "beneficiary_female_count",
  • "beneficiary_male_count",
  • "beneficiary_race_white_count",
  • "beneficiary_race_black_count",
  • "beneficiary_race_asian_pi_count",
  • "beneficiary_race_hispanic_count",
  • "beneficiary_race_nat_ind_count",
  • "beneficiary_race_other_count",
  • "beneficiary_nondual_count",
  • "beneficiary_dual_count",
  • "beneficiary_average_risk_score",
  • ]
  • VALID_PROVIDER_MI = [
  • "A",
  • "M",
  • "J",
  • "L",
  • "R",
  • "S",
  • "E",
  • "D",
  • "C",
  • "B",
  • "K",
  • "P",
  • "W",
  • "H",
  • "T",
  • "G",
  • "F",
  • "N",
  • "V",
  • "I",
  • "O",
  • "Y",
  • "Z",
  • "U",
  • "Q",
  • "X",
  • ]
  • VALID_GEN = ["F", "M", "Other", "Unknown"]
  • VALID_ENTITIES = ["I", "O"]
  • VALID_DESC_FLAGS = ["S", "T"]
  • VALID_ENROLLS = ["E", "N", "O"]
  • # Some utilities
  • def convert_to_float(x):
  • try:
  • return float(x)
  • except:
  • return None
  • def convert_to_int(x):
  • try:
  • return int(float(x))
  • except:
  • return None
  • # Read data
  • df = pd.read_csv(INP_FILE, on_bad_lines = "skip", low_memory=False)
  • # Clean categorical variables
  • df["nppes_provider_mi"] = df["nppes_provider_mi"].fillna("Unknown")
  • df["nppes_provider_mi"] = df["nppes_provider_mi"].apply(
  • lambda x: x if x in VALID_PROVIDER_MI else "Unknown"
  • )
  • df["nppes_credentials"] = df["nppes_credentials"].fillna("Unknown")
  • df["nppes_credentials"] = df["nppes_credentials"].apply(
  • lambda x: str(x).replace(".", "")
  • )
  • cred_hash = {
  • "MEDICAL DOCTOR": "MD",
  • "NURSE PRACTITIONER": "NP",
  • }
  • df["nppes_credentials"] = df["nppes_credentials"].apply(
  • lambda x: cred_hash[x] if x in cred_hash else x,
  • )
  • df["nppes_provider_gender"] = df["nppes_provider_gender"].fillna("Unknown")
  • df["nppes_provider_gender"] = df["nppes_provider_gender"].apply(
  • lambda x: x if x in VALID_GEN else "Other",
  • )
  • df["nppes_entity_code"] = df["nppes_entity_code"].apply(
  • lambda x: x if x in VALID_ENTITIES else "Unknown",
  • )
  • df["nppes_provider_zip5"] = df["nppes_provider_zip5"].fillna("Unknown")
  • df["nppes_provider_country"] = df["nppes_provider_country"].apply(
  • lambda x: "US" if x == "US" else "Other",
  • )
  • df["description_flag"] = df["description_flag"].apply(
  • lambda x: x if x in VALID_DESC_FLAGS else "Unknown",
  • )
  • df["medicare_prvdr_enroll_status"] = df["medicare_prvdr_enroll_status"].apply(
  • lambda x: x if x in VALID_ENROLLS else "Unknown",
  • )
  • # Treat missing beneficiary count as it cannot be zero
  • df["bene_count"] = df["bene_count"].apply(
  • convert_to_int
  • ).fillna(-1)
  • tmp_df = df.groupby(
  • "specialty_description", as_index=False,
  • )["bene_count"].mean()
  • bene_count_hash = dict(
  • zip(
  • tmp_df["specialty_description"],
  • tmp_df["bene_count"],
  • )
  • )
  • df["bene_count"] = df.apply(
  • lambda x: x["bene_count"] if x[
  • "bene_count"
  • ] > 0 else bene_count_hash[
  • x["specialty_description"]
  • ],
  • axis=1,
  • )
  • # Treat continuous variables
  • for item in CON_VARS:
  • df[item] = df[item].apply(
  • convert_to_float
  • ).fillna(0.0)
  • # Filter out invalid states
  • df = df[
  • ~df["nppes_provider_state"].isin(
  • ["XX", "E", "N", "S"]
  • )
  • ]
  • # Output
  • df.to_csv(OUT_FILE, index=False)

Generate features

We then generate features. The categorical features are encoded using the average value of a few important variables in each category.

In [2]:

  • import pandas as pd
  • # Input
  • INP_FILE = "cleaned_medicare_data.csv"
  • OUT_FILE = "medicare_features.csv"
  • # Set some parameters
  • CAT_VARS = [
  • "nppes_provider_mi",
  • #"nppes_credentials", # This is rather messy, so ignoring it.
  • "nppes_provider_gender",
  • "nppes_entity_code",
  • "nppes_provider_city",
  • "nppes_provider_zip5",
  • #"nppes_provider_country", # Almost all cases are US
  • "specialty_description",
  • "medicare_prvdr_enroll_status",
  • "nppes_provider_state",
  • ]
  • CON_VARS = [
  • "total_claim_count",
  • "total_30_day_fill_count",
  • "total_drug_cost",
  • "total_day_supply",
  • "bene_count",
  • "total_claim_count_ge65",
  • "total_30_day_fill_count_ge65",
  • "total_drug_cost_ge65",
  • "total_day_supply_ge65",
  • "bene_count_ge65",
  • "brand_claim_count",
  • "brand_drug_cost",
  • "generic_claim_count",
  • "generic_drug_cost",
  • "other_claim_count",
  • "other_drug_cost",
  • "mapd_claim_count",
  • "mapd_drug_cost",
  • "pdp_claim_count",
  • "pdp_drug_cost",
  • "lis_claim_count",
  • "lis_drug_cost",
  • "nonlis_claim_count",
  • "nonlis_drug_cost",
  • "opioid_claim_count",
  • "opioid_drug_cost",
  • "opioid_day_supply",
  • "opioid_bene_count",
  • "antibiotic_claim_count",
  • "antibiotic_drug_cost",
  • "antibiotic_bene_count",
  • "hrm_claim_count_ge65",
  • "hrm_drug_cost_ge65",
  • "hrm_bene_count_ge65",
  • "antipsych_claim_count_ge65",
  • "antipsych_drug_cost_ge65",
  • "antipsych_bene_count_ge65",
  • "average_age_of_beneficiaries",
  • "beneficiary_age_less_65_count",
  • "beneficiary_age_65_74_count",
  • "beneficiary_age_75_84_count",
  • "beneficiary_age_greater_84_count",
  • "beneficiary_female_count",
  • "beneficiary_male_count",
  • "beneficiary_race_white_count",
  • "beneficiary_race_black_count",
  • "beneficiary_race_asian_pi_count",
  • "beneficiary_race_hispanic_count",
  • "beneficiary_race_nat_ind_count",
  • "beneficiary_race_other_count",
  • "beneficiary_nondual_count",
  • "beneficiary_dual_count",
  • "beneficiary_average_risk_score",
  • ]
  • # Read and clean data
  • df = pd.read_csv(INP_FILE, low_memory=False)
  • # Embed categorical features
  • embedded_cat_features = []
  • for item in CAT_VARS:
  • tmp_df = df.groupby(item, as_index=False).agg(
  • {
  • "opioid_claim_count": "mean",
  • "opioid_drug_cost": "mean",
  • "opioid_day_supply": "mean",
  • "opioid_bene_count": "mean",
  • "opioid_prescriber_rate": "mean",
  • }
  • ).rename(
  • columns={
  • "opioid_claim_count": "%s_opioid_claim_count" % item,
  • "opioid_drug_cost": "%s_opioid_drug_cost" % item,
  • "opioid_day_supply": "%s_opioid_day_supply" % item,
  • "opioid_bene_count": "%s_opioid_bene_count" % item,
  • "opioid_prescriber_rate": "%s_opioid_prescriber_rate" % item,
  • }
  • )
  • df = df.merge(tmp_df, how="left", on=item)
  • embedded_cat_features += [
  • "%s_opioid_claim_count" % item,
  • "%s_opioid_drug_cost" % item,
  • "%s_opioid_day_supply" % item,
  • "%s_opioid_bene_count" % item,
  • "%s_opioid_prescriber_rate" % item,
  • ]
  • # Drop unembedded categorical variables and some others
  • df = df[["npi"] + CON_VARS + embedded_cat_features]
  • # Write features file
  • df.to_csv(OUT_FILE, index=False)

Feature Selection

Once the features are generated, we can implement the above-mentioned feature selection algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe.

In [3]:

  • # Import libs
  • import sys
  • import os
  • import time
  • import numpy as np
  • import pandas as pd
  • from qci_client import QciClient
  • # Define some parameters
  • FEATURES_FILE = "medicare_features.csv"
  • REDUCED_DIM = 10
  • # Read features
  • df = pd.read_csv(FEATURES_FILE, low_memory=False)

We now print the feature names and get the total count of features in the dataset,

In [4]:

  • feature_names = list(set(df.columns) - {"npi"})
  • orig_dim = len(feature_names)
  • print(
  • "Original dimension is %d; reduced dimension will be %d" % (
  • orig_dim,
  • REDUCED_DIM,
  • )
  • )

Out [ ]:

Original dimension is 93; reduced dimension will be 10

We should now create the objective matrix CCC,

In [5]:

  • # Generate the objective matrix
  • X = np.array(df[feature_names])
  • C = abs(np.corrcoef(X, rowvar=False))
  • # Make correlation symmetric to machine precision
  • C = 0.5 * (C + C.transpose())
  • objective = C - np.eye(orig_dim)
  • objective = np.array(objective, dtype=np.float32)

And create the constraint matrix,

In [6]:

  • # Generate the constraint
  • cons_lhs = np.ones(shape=(orig_dim), dtype=np.float32)
  • cons_lhs = cons_lhs
  • cons_rhs = np.array([-REDUCED_DIM])
  • constraints = np.hstack([cons_lhs, cons_rhs])
  • print(constraints.shape)

Out [ ]:

(94,)

We now solve the above quadratic binary problem using QCi's Dirac-1,

In [ ]:

  • token = "<your_secret_token>"
  • api_url = "https://api.qci-prod.com"
  • qci = QciClient(api_token=token, url=api_url)

In [8]:

  • # Create json objects
  • objective_json = {
  • "file_name": "objective_tutorial.json",
  • "file_config": {
  • "objective": {"data": objective, "num_variables": orig_dim},
  • }
  • }
  • constraint_json = {
  • "file_name": "constraints_tutorial.json",
  • "file_config": {
  • "constraints": {
  • "data": constraints,
  • #"num_variables": orig_dim,
  • #"num_constraints": 1,
  • }
  • }
  • }
  • # Solve the optimizzation problem
  • #qci = QciClient()
  • objective_file_id = qci.upload_file(file=objective_json)["file_id"]
  • constraint_file_id = qci.upload_file(file=constraint_json)["file_id"]
  • # Setup job json
  • job_params = {
  • "device_type": "dirac-1",
  • "alpha": 5.0,
  • "num_samples": 20,
  • }
  • body = qci.build_job_body(
  • job_type="sample-constraint",
  • job_params=job_params,
  • constraints_file_id=constraint_file_id,
  • objective_file_id=objective_file_id,
  • job_name=f"tutorial_eqc1",
  • job_tags=["tutorial_eqc1"],
  • )
  • # Run the job
  • job_response_json = qci.process_job(job_body=body)
  • print(job_response_json)
  • results = job_response_json["results"]
  • energies = results["energies"]
  • samples = results["solutions"]
  • is_feasibles = results["feasibilities"]
  • if True:
  • print("Energies:", energies)
  • # Pick a feasible solution with lowest energy
  • # The sample solutions are sorted by energy
  • sol = None
  • for i, item in enumerate(samples):
  • sol = item
  • is_feasible = is_feasibles[i]
  • if is_feasible:
  • break
  • if not is_feasible:
  • print("Solution is not feasible!")
  • assert sol is not None, "No feasible solution found!"
  • assert len(sol) == orig_dim, "Inconsistent solution size!"
  • assert sum(sol) == REDUCED_DIM, "Solution is not feasible!"

Out [ ]:

2024-05-08 10:28:12 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 10:28:12 - Job submitted: job_id='663bb62cd448b017e54f94bd'
2024-05-08 10:28:12 - QUEUED
2024-05-08 10:28:15 - RUNNING
2024-05-08 10:48:23 - COMPLETED
2024-05-08 10:48:26 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663bb62cd448b017e54f94bd', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_linearly_constrained_binary_optimization': {'constraints_file_id': '663bb62c98263204a3657526', 'objective_file_id': '663bb62b98263204a3657524', 'alpha': 5, 'atol': 1e-10}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T17:28:12.687Z', 'queued_at_rfc3339nano': '2024-05-08T17:28:12.689Z', 'running_at_rfc3339nano': '2024-05-08T17:28:13.046Z', 'completed_at_rfc3339nano': '2024-05-08T17:48:21.532Z'}, 'job_result': {'file_id': '663bbae598263204a3657528', 'device_usage_s': 1128}}, 'status': 'COMPLETED', 'results': {'counts': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'energies': [-497.8916331699111, -497.8896800449111, -497.8896800449111, -497.8887034824111, -497.8847972324111, -497.8847972324111, -497.8847972324111, -497.8838206699111, -497.8789378574111, -497.8623362949111, -497.8584300449111, -497.8515941074111, -497.8291331699111, -497.8125316074111, -497.7988597324111, -497.7949534824111, -497.7773753574111, -497.7773753574111, -497.7724925449111, -497.7724925449111], 'feasibilities': [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], 'objective_values': [2.108758625341579, 2.109875851077959, 2.109875851077959, 2.1111518519464876, 2.1151239282917222, 2.1151239282917227, 2.115212611388415, 2.1162998459767546, 2.121476054191589, 2.13726658251835, 2.1418038606643672, 2.148616509046405, 2.1707787389168516, 2.1876401392510156, 2.2009708418045193, 2.2050841469317675, 2.2229102615965526, 2.2229102615965526, 2.227802827401319, 2.227861546212807], 'solutions': [[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]}}
Energies: [-497.8916331699111, -497.8896800449111, -497.8896800449111, -497.8887034824111, -497.8847972324111, -497.8847972324111, -497.8847972324111, -497.8838206699111, -497.8789378574111, -497.8623362949111, -497.8584300449111, -497.8515941074111, -497.8291331699111, -497.8125316074111, -497.7988597324111, -497.7949534824111, -497.7773753574111, -497.7773753574111, -497.7724925449111, -497.7724925449111]

Finally, we can print the list of selected variables,

In [9]:

  • selected_vars = []
  • for i in range(orig_dim):
  • if sol[i] > 0:
  • selected_vars.append(feature_names[i])
  • print(selected_vars)

Out [ ]:

['nppes_provider_city_opioid_drug_cost', 'beneficiary_race_black_count', 'nppes_provider_gender_opioid_drug_cost', 'specialty_description_opioid_drug_cost', 'nppes_provider_mi_opioid_drug_cost', 'nppes_entity_code_opioid_prescriber_rate', 'beneficiary_race_nat_ind_count', 'antipsych_drug_cost_ge65', 'beneficiary_average_risk_score', 'beneficiary_race_asian_pi_count']

Conclusion

This tutorial is an example of an application of quadratic linearly-constrained binary optimization. A tutorial of which is available here and can give a number of other applications of this formalism. Another direction is to examine what can be done with the unconstrained version. Of course, you could also try applying Dirac to one of your own problems.

Like portfolio optimization, QBoost, and dimensionality reduction, this tutorial is a variation on the same theme of taking advantage of the correlation structure of an underlying data set. Such problems keep arising both because they have important applications, and because they are naturally expressed as QUBOs, given the importance of two-body correlations. Trying one of these tutorials is a natural next step.

Previous page
Next page

Content

  • Feature Selection on Dirac
  • Device: Dirac-1
  • Introduction
  • Importance
  • Applications
  • Methodology
  • Medicare Prescription Data
  • Clean data
  • Generate features
  • Feature Selection
  • Conclusion