Dimensionality reduction on Dirac

Device: Dirac-1

Introduction

In machine learning problems, we often have to start with a large number of features. We can use a dimensionality reduction approach to reduce the number of features, especially when the features are used for an unsupervised algorithm such as clustering. In what follows, we present a tutorial on using QCi's technology to implement a QUBO based Singular Value Decomposition (SVD) algorithm. Slightly differently from most other Dirac tutorials, this tutorial does not focus on solving a computationally hard (NP-hard for example) problem. Singular value decompositions can effectively be performed in a similar method to what we used here but using matrix diagonalization routines (for example, those available in LAPACK or ARPACK which, can be used through SciPy, MATLAB, and Mathematica) instead of QUBO solvers. The motivation here is to show the versatility of our hardware and the underlying QUBO model.

Importance

Singular value decompositions, and the highly related task of matrix diagonalization, are a key to many applications. One of the key applications is the one studied here, principal component analysis, which provides a method to extract the most important variables to explain data. The first component is effectively the direction in a high dimensional space over which the data vary the most. Informally, this can be thought of as the combination of variables that are able to provide the most information about the data points. The second explains it the second most, et cetera. Dimensionality reduction is thus the task of building a subset of quantities from the original data that provide the most information, but can be much smaller than the total information contained in each data point. While we will not discuss it here, this can be thought of as taking a subspace of a much higher dimensional space, by effectively fitting to a high-dimensional ellipse and taking the directions where the ellipse is the largest. This process is easily visualized in two dimensions (see for example the wikipedia article on principal component analysis) but cannot be in the higher dimensions that are of practical interest.

Applications

As an unsupervised method, principal component analysis is good for exploring data without prior knowledge of the structure (a review can be found here). The example here shows a problem related to analyzing prescription data, but there are many other potential applications for principal component analysis. For example, this technique is extensively used in the chemical analysis of food samples, where the number of chemical tests should be reduced to contain only the most relevant ones. Another example is in portfolio optimization, where diversification can be achieved by choosing stocks that contribute strongly to different principal components. An additional very different application is in atmospheric science, where the most important modes in atmospheric movement (weather patterns) can be identified by finding principal components corresponding to sets of regions where quantities such as pressure are highly correlated. However, care must be taken in how these methods are applied.

Methodology

Principal Component Analysis (PCA) is a dimensionality reduction technique which is based on a Singular Value Decomposition (SVD) of the data matrix. Let us have a dataset with $N$ $N$ samples and $d$ $d$ features, represented by a $N \times d$ $N \times d$ matrix $F$ $F$ . We can decompose $F$ $F$ as,

F = U D V^T

F = U D V^{T}

where $U$ $U$ is an $N \times d$ $N \times d$ matrix with orthonormal columns, $D$ $D$ is a diagonal square $d \times d$ $d \times d$ matrix, and $V$ $V$ is an orthogonal matrix with orthonormal columns and rows. Note that columns of $U$ $U$ form an orthonormal basis ${\{\bf{u_k}\}}_{k \in \{1, 2,..., d\}}$ ${u_{k}}_{k \in {1, 2, ..., d}}$ for the vector space that is spanned by columns $F$ $F$ . It can be shown that the larger the $k$ $k$ -th diagonal element of $D$ $D$ , the more the contribution of the basis element ${\bf{u_k}}$ $u_{k}$ . Thus, one can rank the basis elements ${\bf{u_k}}$ $u_{k}$ based on the value of $D_{kk}$ $D_{kk}$ , and choose a subset of most important basis elements; this yields a lower dimensional representation of the data matrix $F$ $F$ . Assuming that $D_{00} > D_{11} > ... > D_{dd}$ $D_{00} > D_{11} > ... > D_{dd}$ , ${\bf{u_0}}$ $u_{0}$ is said to be the first principal component of the original matrix $F$ $F$ .

It can be shown that

{\bf{u_k}} = \frac{1}{\sqrt{\lambda_k}} F {\bf{v_k}}

u_{k} = \frac{1}{λ _{k}} F v_{k}

where $\lambda_k$ $λ_{k}$ and ${\bf{v_k}}$ $v_{k}$ denote the $k$ $k$ -th eigenvalue and normalized eigenvector of the orthogonal $d \times d$ $d \times d$ matrix $G := F^T F$ $G := F^{T} F$ . Note too that ${\bf{v_k}}$ $v_{k}$ is in fact the $k$ $k$ -th column of matrix $V$ $V$ and that $\sqrt{\lambda_k}$ $λ_{k}$ is the $k$ $k$ -th diagonal element of matrix $D$ $D$ . Finding the first principal component of $F$ $F$ is then equivalent to finding the eigenvector of $G$ $G$ corresponding to its maximum eigenvalue, that is the eigenvector of $-G$ $- G$ corresponding to its minimum eigenvalue. An estimation to the first principal component of $F$ $F$ can then be obtained by solving a QUBO as,

\min_{\bf{q}} {-\bf{q}^{T}} G {\bf{q}}

min_{q} - q^{T} G q

where ${\bf{q}}$ $q$ is a $d$ $d$ -dimensional binary vector. The first principal component of $F$ $F$ is,

{\bf{u_0}} = \frac{1}{\sqrt{\lambda_0}} F {\bf{q}}

u_{0} = \frac{1}{λ _{0}} F q

where $\lambda_0 = {\bf{v_0}}^T G {\bf{v_0}}$ $λ_{0} = v_{0}^{T} G v_{0}$ , and ${\bf{v_0}} = \frac{{\bf{q}}}{||{\bf{q}}||}$ $v_{0} = \frac{q}{∣∣ q ∣∣}$ . One can then replace each column of $F$ $F$ with its component orthogonal to ${\bf{u_0}}$ $u_{0}$ , and repeat the above-mentioned process to get the second principal component ${\bf{u_1}}$ $u_{1}$ . This can be repeated to get any desired number of principal components of $F$ $F$ , say ${\bf{u_0}}, {\bf{u_1}},..., {\bf{u_{d^\prime}}}$ $u_{0}, u_{1}, ..., u_{d^{'}}$ , where $d^\prime < d$ $d^{'} < d$ . To remove a principal component ${\bf{u}}$ $u$ from a feature matrix $F$ $F$ , we have,

F^{new} = F - {\bf{u}} {\bf{u}^{T}} F

F^{n e w} = F - u u^{T} F

Note that this is an estimation to a full PCA as the vector $q$ $q$ in the above equation is a binary vector. This estimation is equivalent to quantizing the directions in feature space such that a potentially large but finite set of directions can be chosen by PCA. These are the directions that correspond to the $d$ $d$ -dimensional binary vector $q$ $q$ .

Medicare Prescription Data

We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber

Clean data

We start by cleaning the dataset.

In [2]:

import pandas as pd
# Input                                                                                       
INP_FILE = "Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv"
OUT_FILE  = "cleaned_medicare_data.csv"
CON_VARS = [
    "total_claim_count",
    "total_30_day_fill_count",
    "total_drug_cost",
    "total_day_supply",
    "bene_count",
    "total_claim_count_ge65",
    "total_30_day_fill_count_ge65",
    "total_drug_cost_ge65",
    "total_day_supply_ge65",
    "bene_count_ge65",
    "brand_claim_count",
    "brand_drug_cost",
    "generic_claim_count",
    "generic_drug_cost",
    "other_claim_count",
    "other_drug_cost",
    "mapd_claim_count",
    "mapd_drug_cost",
    "pdp_claim_count",
    "pdp_drug_cost",
    "lis_claim_count",
    "lis_drug_cost",
    "nonlis_claim_count",
    "nonlis_drug_cost",
    "opioid_claim_count",
    "opioid_drug_cost",
    "opioid_day_supply",
    "opioid_bene_count",
    "opioid_prescriber_rate",
    "antibiotic_claim_count",
    "antibiotic_drug_cost",
    "antibiotic_bene_count",
    "hrm_claim_count_ge65",
    "hrm_drug_cost_ge65",
    "hrm_bene_count_ge65",
    "antipsych_claim_count_ge65",
    "antipsych_drug_cost_ge65",
    "antipsych_bene_count_ge65",
    "average_age_of_beneficiaries",
    "beneficiary_age_less_65_count",
    "beneficiary_age_65_74_count",
    "beneficiary_age_75_84_count",
    "beneficiary_age_greater_84_count",
    "beneficiary_female_count",
    "beneficiary_male_count",
    "beneficiary_race_white_count",
    "beneficiary_race_black_count",
    "beneficiary_race_asian_pi_count",
    "beneficiary_race_hispanic_count",
    "beneficiary_race_nat_ind_count",
    "beneficiary_race_other_count",
    "beneficiary_nondual_count",
    "beneficiary_dual_count",
    "beneficiary_average_risk_score",
]
VALID_PROVIDER_MI = [
    "A",
    "M",
    "J",
    "L",
    "R",
    "S",
    "E",
    "D",
    "C",
    "B",
    "K",
    "P",
    "W",
    "H",
    "T",
    "G",
    "F",
    "N",
    "V",
    "I",
    "O",
    "Y",
    "Z",
    "U", 
    "Q",
    "X",
]
VALID_GEN = ["F", "M", "Other", "Unknown"]
VALID_ENTITIES = ["I", "O"]
VALID_DESC_FLAGS = ["S", "T"]
VALID_ENROLLS = ["E", "N", "O"]
# Some utilities                                                                              
def convert_to_float(x):
    try:
        return float(x)
    except:
        return None
    
def convert_to_int(x):
    try:
        return int(float(x))
    except:
        return None
# Read data                                                                                   
df = pd.read_csv(INP_FILE, on_bad_lines = "skip", low_memory=False)
    
# Clean categorical variables                                                                 
df["nppes_provider_mi"] = df["nppes_provider_mi"].fillna("Unknown")
df["nppes_provider_mi"] = df["nppes_provider_mi"].apply(
    lambda x: x if x in VALID_PROVIDER_MI else "Unknown"
)
df["nppes_credentials"] = df["nppes_credentials"].fillna("Unknown")
df["nppes_credentials"] = df["nppes_credentials"].apply(
    lambda x: str(x).replace(".", "")
)
cred_hash = {
    "MEDICAL DOCTOR": "MD",
    "NURSE PRACTITIONER": "NP",
}
df["nppes_credentials"] = df["nppes_credentials"].apply(
    lambda x: cred_hash[x] if x in cred_hash else x,
)
df["nppes_provider_gender"] = df["nppes_provider_gender"].fillna("Unknown")
df["nppes_provider_gender"] = df["nppes_provider_gender"].apply(
    lambda x: x if x in VALID_GEN else "Other",
)
df["nppes_entity_code"] = df["nppes_entity_code"].apply(
    lambda x: x if x in VALID_ENTITIES else "Unknown",
)
df["nppes_provider_zip5"] = df["nppes_provider_zip5"].fillna("Unknown")
df["nppes_provider_country"] = df["nppes_provider_country"].apply(
    lambda x: "US" if x == "US" else "Other",
)
df["description_flag"] = df["description_flag"].apply(
    lambda x: x if x in VALID_DESC_FLAGS else "Unknown",
)
df["medicare_prvdr_enroll_status"] = df["medicare_prvdr_enroll_status"].apply(
    lambda x: x if x in VALID_ENROLLS else "Unknown",
)
# Treat missing beneficiary count as it cannot be zero                                        
df["bene_count"] = df["bene_count"].apply(
    convert_to_int
).fillna(-1)
tmp_df = df.groupby(
    "specialty_description", as_index=False,
)["bene_count"].mean()
bene_count_hash = dict(
    zip(
        tmp_df["specialty_description"],
        tmp_df["bene_count"],
    )
)
df["bene_count"] = df.apply(
    lambda x: x["bene_count"] if x[
        "bene_count"
    ] > 0 else bene_count_hash[
        x["specialty_description"]
    ],
    axis=1,
)
# Treat continuous variables                                                                  
for item in CON_VARS:
    df[item] = df[item].apply(
        convert_to_float
    ).fillna(0.0)
# Filter out invalid states                                                                   
df  = df[
    ~df["nppes_provider_state"].isin(
        ["XX", "E", "N", "S"]
    )
]
# Output                                                                                      
df.to_csv(OUT_FILE, index=False)

Generate features

We then generate features. The categorical features are encoded using the average value of a few important variables in each category.

In [3]:

import pandas as pd
# Input                                                                                       
INP_FILE = "cleaned_medicare_data.csv"
OUT_FILE  = "medicare_features.csv"
# Set some parameters                                                                         
CAT_VARS = [
    "nppes_provider_mi",
    #"nppes_credentials", # This is rather messy, so ignoring it.                             
    "nppes_provider_gender",
    "nppes_entity_code",
    "nppes_provider_city",
    "nppes_provider_zip5",
    #"nppes_provider_country", # Almost all cases are US                                      
    "specialty_description",
    "medicare_prvdr_enroll_status",
    "nppes_provider_state",
]
CON_VARS = [
    "total_claim_count",
    "total_30_day_fill_count",
    "total_drug_cost",
    "total_day_supply",
    "bene_count",
    "total_claim_count_ge65",
    "total_30_day_fill_count_ge65",
    "total_drug_cost_ge65",
    "total_day_supply_ge65",
    "bene_count_ge65",
    "brand_claim_count",
    "brand_drug_cost",
    "generic_claim_count",
    "generic_drug_cost",
    "other_claim_count",
    "other_drug_cost",
    "mapd_claim_count",
    "mapd_drug_cost",
    "pdp_claim_count",
    "pdp_drug_cost",
    "lis_claim_count",
    "lis_drug_cost",
    "nonlis_claim_count",
    "nonlis_drug_cost",
    "opioid_claim_count",
    "opioid_drug_cost",
    "opioid_day_supply",
    "opioid_bene_count",
    "antibiotic_claim_count",
    "antibiotic_drug_cost",
    "antibiotic_bene_count",
    "hrm_claim_count_ge65",
    "hrm_drug_cost_ge65",
    "hrm_bene_count_ge65",
    "antipsych_claim_count_ge65",
    "antipsych_drug_cost_ge65",
    "antipsych_bene_count_ge65",
    "average_age_of_beneficiaries",
    "beneficiary_age_less_65_count",
    "beneficiary_age_65_74_count",
    "beneficiary_age_75_84_count",
    "beneficiary_age_greater_84_count",
    "beneficiary_female_count",
    "beneficiary_male_count",
    "beneficiary_race_white_count",
    "beneficiary_race_black_count",
    "beneficiary_race_asian_pi_count",
    "beneficiary_race_hispanic_count",
    "beneficiary_race_nat_ind_count",
    "beneficiary_race_other_count",
    "beneficiary_nondual_count",
    "beneficiary_dual_count",
    "beneficiary_average_risk_score",
]
# Read and clean data                                                                         
df = pd.read_csv(INP_FILE, low_memory=False)
# Embed categorical features                                                                  
embedded_cat_features = []
for item in CAT_VARS:
    tmp_df = df.groupby(item, as_index=False).agg(
        {
            "opioid_claim_count": "mean",
            "opioid_drug_cost": "mean",
            "opioid_day_supply": "mean",
            "opioid_bene_count": "mean",
            "opioid_prescriber_rate": "mean",
        }
    ).rename(
        columns={
            "opioid_claim_count": "%s_opioid_claim_count" % item,
            "opioid_drug_cost": "%s_opioid_drug_cost" % item,
            "opioid_day_supply": "%s_opioid_day_supply" % item,
            "opioid_bene_count": "%s_opioid_bene_count" % item,
            "opioid_prescriber_rate": "%s_opioid_prescriber_rate" % item,
        }
    )
    df = df.merge(tmp_df, how="left", on=item)
    embedded_cat_features += [
        "%s_opioid_claim_count" % item,
        "%s_opioid_drug_cost" % item,
        "%s_opioid_day_supply" % item,
        "%s_opioid_bene_count" % item,
        "%s_opioid_prescriber_rate" % item,
    ]
# Drop unembedded categorical variables and some others                                       
df = df[["npi"] + CON_VARS + embedded_cat_features]
# Write features file                                                                         
df.to_csv(OUT_FILE, index=False)

Dimensionality Reduction

Once the features are generated, we can implement the above-mentioned SVD algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe.

In [4]:

# Import libs
import sys
import os
import time
import numpy as np
import pandas as pd
from qci_client import QciClient
# Define some parameters
FEATURES_FILE = "medicare_features.csv"
REDUCED_DIM = 3
# Read features
df = pd.read_csv(FEATURES_FILE, low_memory=False)

We now print the feature names and get the total count of features in the dataset,

In [5]:

feature_names = list(set(df.columns) - {"npi"})
orig_dim = len(feature_names)
print(
    "Original dimension is %d; reduced dimension will be %d" % (
        orig_dim,
        REDUCED_DIM,
    )
)

Out [ ]:

Original dimension is 93; reduced dimension will be 3

We now define a function that gets the first principal component of features by solving a QUBO,

In [6]:

token = "your_token"
api_url = "https://api.qci-prod.com"
qci = QciClient(api_token=token, url=api_url)

In [7]:

def get_first_principal_comp(F):
    
    qubo = -np.matmul(F.transpose(), F)
    # Make sure matrix is symmetric to machine precision
    qubo = 0.5 * (qubo + qubo.transpose())
    # Create json objects 
    qubo_json = {
        "file_name": "qubo_tutorial.json",
        "file_config": {
            "qubo": {"data": qubo, "num_variables": orig_dim},
        }  
    }
    
    # Solve the optimizzation problem
    #qci = QciClient()
    response_json = qci.upload_file(file=qubo_json)
    qubo_file_id = response_json["file_id"]
    # Setup job json
    job_params = {
        "device_type": "dirac-1", 
        "alpha": 1.0, 
        "num_samples": 20,
    }
    job_json = qci.build_job_body(
        job_type="sample-qubo", 
        job_params=job_params,
        qubo_file_id=qubo_file_id,
        job_name="tutorial_eqc1",
        job_tags=["tutorial_eqc1"],
    )
    print(job_json)
    # Run the job
    job_response_json = qci.process_job(
        job_body=job_json,
    )
    print(job_response_json)
    results = job_response_json["results"]
    energies = results["energies"]
    samples = results["solutions"]
    if True:
        print("Energies:", energies)        
    q = np.array(samples[0])
    assert len(q) == orig_dim, "Inconsistent solution size!"
    fct = np.linalg.norm(q)
    if fct > 0:
        fct = 1.0 / fct
    v0 = fct * q
    v0 = v0.reshape((v0.shape[0], 1))
    lambda0 = np.matmul(
        np.matmul(v0.transpose(), -qubo),
        v0
    )[0][0]
    assert lambda0 >= 0, "Unexpected negative eigenvalue!"
    fct = np.sqrt(lambda0)
    if fct > 0:
        fct = 1.0 / fct
    u0 = fct * np.matmul(F, v0)
    u0 = u0.reshape(-1)
    fct = np.linalg.norm(u0)
    if fct > 0:
        fct = 1.0 / fct
    u0 = fct * u0
    return u0

One can get the first REDUCED_DIM components by applying the above function recursively,

In [8]:

F = np.array(df[feature_names])
U = []
for i in range(min(REDUCED_DIM, F.shape[1])):
    u = get_first_principal_comp(F)
    U.append(u)
    u = u.reshape((u.shape[0], 1))
    F = F - np.matmul(
    	u,
        np.matmul(u.transpose(), F),
	)

Out [ ]:

{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba43298263204a36574d0'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}
2024-05-08 09:11:30 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 09:11:30 - Job submitted: job_id='663ba432d448b017e54f94a8'
2024-05-08 09:11:31 - QUEUED
2024-05-08 09:11:33 - RUNNING
2024-05-08 09:16:54 - COMPLETED
2024-05-08 09:16:57 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663ba432d448b017e54f94a8', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba43298263204a36574d0'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:11:30.979Z', 'queued_at_rfc3339nano': '2024-05-08T16:11:30.98Z', 'running_at_rfc3339nano': '2024-05-08T16:11:31.644Z', 'completed_at_rfc3339nano': '2024-05-08T16:16:52.435Z'}, 'job_result': {'file_id': '663ba57498263204a36574d2', 'device_usage_s': 281}}, 'status': 'COMPLETED', 'results': {'counts': [19, 1], 'energies': [-2802848646149886000, -2802846722004537300], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}}
Energies: [-2802848646149886000, -2802846722004537300]
{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba5a298263204a36574d4'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}
2024-05-08 09:17:38 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 09:17:38 - Job submitted: job_id='663ba5a2d448b017e54f94a9'
2024-05-08 09:17:38 - QUEUED
2024-05-08 09:17:41 - RUNNING
2024-05-08 09:23:02 - COMPLETED
2024-05-08 09:23:04 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663ba5a2d448b017e54f94a9', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba5a298263204a36574d4'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:17:38.906Z', 'queued_at_rfc3339nano': '2024-05-08T16:17:38.909Z', 'running_at_rfc3339nano': '2024-05-08T16:17:39.763Z', 'completed_at_rfc3339nano': '2024-05-08T16:22:59.746Z'}, 'job_result': {'file_id': '663ba6e398263204a36574d8', 'device_usage_s': 279}}, 'status': 'COMPLETED', 'results': {'counts': [13, 4, 2, 1], 'energies': [-60631112719794140, -60631112719794140, -60631108424826850, -60631052590252000], 'solutions': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]]}}
Energies: [-60631112719794140, -60631112719794140, -60631108424826850, -60631052590252000]
{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba70d98263204a36574da'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}
2024-05-08 09:23:41 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 09:23:41 - Job submitted: job_id='663ba70dd448b017e54f94ab'
2024-05-08 09:23:41 - QUEUED
2024-05-08 09:37:48 - RUNNING
2024-05-08 09:43:10 - COMPLETED
2024-05-08 09:43:12 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663ba70dd448b017e54f94ab', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba70d98263204a36574da'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:23:41.935Z', 'queued_at_rfc3339nano': '2024-05-08T16:23:41.936Z', 'running_at_rfc3339nano': '2024-05-08T16:37:48.722Z', 'completed_at_rfc3339nano': '2024-05-08T16:43:08.614Z'}, 'job_result': {'file_id': '663bab9c98263204a36574de', 'device_usage_s': 279}}, 'status': 'COMPLETED', 'results': {'counts': [12, 8], 'energies': [-31525704197734400, -31525704197734400], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}
Energies: [-31525704197734400, -31525704197734400]

We can save the results and echo some information,

In [9]:

U = np.array(U)
print(U.shape)
np.save("U.npy", U)

Out [ ]:

(3, 1022499)

We can do a classical SVD,

In [10]:

from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
F = np.array(df[feature_names])
pca_classical = PCA(n_components=REDUCED_DIM)
U_classical = pca_classical.fit_transform(F)
U_classical = normalize(U_classical, axis=0, norm="l2")

and compare the results to those of the above-mentioned approach,

In [12]:

U = np.load("U.npy")
#U = U_classical.transpose()
print(U.shape)
print(U_classical.shape)
print(
    abs(
        np.diag(
        	np.matmul(U, U_classical)
        )
    )
)

Out [ ]:

(3, 1022499)
(1022499, 3)
[0.9101745  0.93294887 0.82830476]

Conclusion

In this tutorial, we have demonstrated the versatility of our Dirac-1 machine and its underlying QUBO model. Rather than an NP-hard optimization problem, we instead show that our device can be used for dimensionality reduction using principal component analysis, a powerful matrix based data-science method. For more traditional QUBO applications, see here. This includes a QUBO based machine learning method known as QBoost. You could also try out these devices on your own problems.

Like portfolio optimization, QBoost, and feature selection, this tutorial is a variation on the same theme of taking advantage of the correlation structure of an underlying data set. Such problems keep arising both because they have important applications, and because they are naturally expressed as QUBOs, given the importance of two-body correlations. Trying one of these tutorials is a natural next step.