Dimensionality reduction on Dirac
Device: Dirac-1
Introduction
In machine learning problems, we often have to start with a large number of features. We can use a dimensionality reduction approach to reduce the number of features, especially when the features are used for an unsupervised algorithm such as clustering. In what follows we present a tutorial on using QCi's technology to implement a QUBO based Singular Value Decomposition (SVD) algorithm.
Methodology
Principal Component Analysis (PCA) is a dimensionality reduction technique which is based on a Singular Value Decomposition (SVD) of the data matrix. Let us have a dataset with samples and features, represented by a matrix . We can decompose as,
where is an matrix with orthonormal columns, is a diagonal square matrix, and is an orthogonal matrix with orthonormal columns and rows. Note that columns of form an orthonormal basis for the vector space that is spanned by columns . It can be shown that the larger the -th diagonal element of , the more the contribution of the basis element . Thus, one can rank the basis elements based on the value of , and choose a subset of most important basis elements; this yields a lower dimensional representation of the data matrix . Assuming that , is said to be the first principal component of the original matrix .
It can be shown that
where and denote the -th eigenvalue and normalized eigenvector of the orthogonal matrix . Note too that is in fact the -th column of matrix and that is the -th diagonal element of matrix . Finding the first principal component of is then equivalent to finding the eigenvector of corresponding to its maximum eigenvalue, that is the eigenvector of corresponding to its minimum eigenvalue. An estimation to the first principal component of can then be obtained by solving a QUBO as,
where is a -dimensional binary vector. The first principal component of is,
where , and . One can then replace each column of with its component orthogonal to , and repeat the above-mentioned process to get the second principal component . This can be repeated to get any desired number of principal components of , say , where . To remove a principal component from a feature matrix , we have,
Note that this is an estimation to a full PCA as the vector in the above equation is a binary vector. This estimation is equivalent to quantizing the directions in feature space such that a potentially large but finite set of directions can be chosen by PCA. These are the directions that correspond to the -dimensional binary vector .
Medicare Prescription Data
We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber
Clean data
We start by cleaning the dataset.
In [1]:
import pandas as pd# Input INP_FILE = "Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv"OUT_FILE = "cleaned_medicare_data.csv"CON_VARS = ["total_claim_count","total_30_day_fill_count","total_drug_cost","total_day_supply","bene_count","total_claim_count_ge65","total_30_day_fill_count_ge65","total_drug_cost_ge65","total_day_supply_ge65","bene_count_ge65","brand_claim_count","brand_drug_cost","generic_claim_count","generic_drug_cost","other_claim_count","other_drug_cost","mapd_claim_count","mapd_drug_cost","pdp_claim_count","pdp_drug_cost","lis_claim_count","lis_drug_cost","nonlis_claim_count","nonlis_drug_cost","opioid_claim_count","opioid_drug_cost","opioid_day_supply","opioid_bene_count","opioid_prescriber_rate","antibiotic_claim_count","antibiotic_drug_cost","antibiotic_bene_count","hrm_claim_count_ge65","hrm_drug_cost_ge65","hrm_bene_count_ge65","antipsych_claim_count_ge65","antipsych_drug_cost_ge65","antipsych_bene_count_ge65","average_age_of_beneficiaries","beneficiary_age_less_65_count","beneficiary_age_65_74_count","beneficiary_age_75_84_count","beneficiary_age_greater_84_count","beneficiary_female_count","beneficiary_male_count","beneficiary_race_white_count","beneficiary_race_black_count","beneficiary_race_asian_pi_count","beneficiary_race_hispanic_count","beneficiary_race_nat_ind_count","beneficiary_race_other_count","beneficiary_nondual_count","beneficiary_dual_count","beneficiary_average_risk_score",]VALID_PROVIDER_MI = ["A","M","J","L","R","S","E","D","C","B","K","P","W","H","T","G","F","N","V","I","O","Y","Z","U", "Q","X",]VALID_GEN = ["F", "M", "Other", "Unknown"]VALID_ENTITIES = ["I", "O"]VALID_DESC_FLAGS = ["S", "T"]VALID_ENROLLS = ["E", "N", "O"]# Some utilities def convert_to_float(x):try:return float(x)except:return Nonedef convert_to_int(x):try:return int(float(x))except:return None# Read data df = pd.read_csv(INP_FILE, on_bad_lines = "skip", low_memory=False)# Clean categorical variables df["nppes_provider_mi"] = df["nppes_provider_mi"].fillna("Unknown")df["nppes_provider_mi"] = df["nppes_provider_mi"].apply(lambda x: x if x in VALID_PROVIDER_MI else "Unknown")df["nppes_credentials"] = df["nppes_credentials"].fillna("Unknown")df["nppes_credentials"] = df["nppes_credentials"].apply(lambda x: str(x).replace(".", ""))cred_hash = {"MEDICAL DOCTOR": "MD","NURSE PRACTITIONER": "NP",}df["nppes_credentials"] = df["nppes_credentials"].apply(lambda x: cred_hash[x] if x in cred_hash else x,)df["nppes_provider_gender"] = df["nppes_provider_gender"].fillna("Unknown")df["nppes_provider_gender"] = df["nppes_provider_gender"].apply(lambda x: x if x in VALID_GEN else "Other",)df["nppes_entity_code"] = df["nppes_entity_code"].apply(lambda x: x if x in VALID_ENTITIES else "Unknown",)df["nppes_provider_zip5"] = df["nppes_provider_zip5"].fillna("Unknown")df["nppes_provider_country"] = df["nppes_provider_country"].apply(lambda x: "US" if x == "US" else "Other",)df["description_flag"] = df["description_flag"].apply(lambda x: x if x in VALID_DESC_FLAGS else "Unknown",)df["medicare_prvdr_enroll_status"] = df["medicare_prvdr_enroll_status"].apply(lambda x: x if x in VALID_ENROLLS else "Unknown",)# Treat missing beneficiary count as it cannot be zero df["bene_count"] = df["bene_count"].apply(convert_to_int).fillna(-1)tmp_df = df.groupby("specialty_description", as_index=False,)["bene_count"].mean()bene_count_hash = dict(zip(tmp_df["specialty_description"],tmp_df["bene_count"],))df["bene_count"] = df.apply(lambda x: x["bene_count"] if x["bene_count"] > 0 else bene_count_hash[x["specialty_description"]],axis=1,)# Treat continuous variables for item in CON_VARS:df[item] = df[item].apply(convert_to_float).fillna(0.0)# Filter out invalid states df = df[~df["nppes_provider_state"].isin(["XX", "E", "N", "S"])]# Output df.to_csv(OUT_FILE, index=False)
Generate features
We then generate features. The categorical features are encoded using the average value of a few important variables in each category.
In [2]:
import pandas as pd# Input INP_FILE = "cleaned_medicare_data.csv"OUT_FILE = "medicare_features.csv"# Set some parameters CAT_VARS = ["nppes_provider_mi",#"nppes_credentials", # This is rather messy, so ignoring it. "nppes_provider_gender","nppes_entity_code","nppes_provider_city","nppes_provider_zip5",#"nppes_provider_country", # Almost all cases are US "specialty_description","medicare_prvdr_enroll_status","nppes_provider_state",]CON_VARS = ["total_claim_count","total_30_day_fill_count","total_drug_cost","total_day_supply","bene_count","total_claim_count_ge65","total_30_day_fill_count_ge65","total_drug_cost_ge65","total_day_supply_ge65","bene_count_ge65","brand_claim_count","brand_drug_cost","generic_claim_count","generic_drug_cost","other_claim_count","other_drug_cost","mapd_claim_count","mapd_drug_cost","pdp_claim_count","pdp_drug_cost","lis_claim_count","lis_drug_cost","nonlis_claim_count","nonlis_drug_cost","opioid_claim_count","opioid_drug_cost","opioid_day_supply","opioid_bene_count","antibiotic_claim_count","antibiotic_drug_cost","antibiotic_bene_count","hrm_claim_count_ge65","hrm_drug_cost_ge65","hrm_bene_count_ge65","antipsych_claim_count_ge65","antipsych_drug_cost_ge65","antipsych_bene_count_ge65","average_age_of_beneficiaries","beneficiary_age_less_65_count","beneficiary_age_65_74_count","beneficiary_age_75_84_count","beneficiary_age_greater_84_count","beneficiary_female_count","beneficiary_male_count","beneficiary_race_white_count","beneficiary_race_black_count","beneficiary_race_asian_pi_count","beneficiary_race_hispanic_count","beneficiary_race_nat_ind_count","beneficiary_race_other_count","beneficiary_nondual_count","beneficiary_dual_count","beneficiary_average_risk_score",]# Read and clean data df = pd.read_csv(INP_FILE, low_memory=False)# Embed categorical features embedded_cat_features = []for item in CAT_VARS:tmp_df = df.groupby(item, as_index=False).agg({"opioid_claim_count": "mean","opioid_drug_cost": "mean","opioid_day_supply": "mean","opioid_bene_count": "mean","opioid_prescriber_rate": "mean",}).rename(columns={"opioid_claim_count": "%s_opioid_claim_count" % item,"opioid_drug_cost": "%s_opioid_drug_cost" % item,"opioid_day_supply": "%s_opioid_day_supply" % item,"opioid_bene_count": "%s_opioid_bene_count" % item,"opioid_prescriber_rate": "%s_opioid_prescriber_rate" % item,})df = df.merge(tmp_df, how="left", on=item)embedded_cat_features += ["%s_opioid_claim_count" % item,"%s_opioid_drug_cost" % item,"%s_opioid_day_supply" % item,"%s_opioid_bene_count" % item,"%s_opioid_prescriber_rate" % item,]# Drop unembedded categorical variables and some others df = df[["npi"] + CON_VARS + embedded_cat_features]# Write features file df.to_csv(OUT_FILE, index=False)
Dimensionality Reduction
Once the features are generated, we can implement the above-mentioned SVD algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe.
In [3]:
# Import libsimport sysimport osimport timeimport numpy as npimport pandas as pdfrom qci_client import QciClient# Define some parametersFEATURES_FILE = "medicare_features.csv"REDUCED_DIM = 3# Read featuresdf = pd.read_csv(FEATURES_FILE, low_memory=False)
We now print the feature names and get the total count of features in the dataset,
In [4]:
feature_names = list(set(df.columns) - {"npi"})orig_dim = len(feature_names)print("Original dimension is %d; reduced dimension will be %d" % (orig_dim,REDUCED_DIM,))
Out [ ]:
Original dimension is 93; reduced dimension will be 3
We now define a function that gets the first principal component of features by solving a QUBO,
In [9]:
def get_first_principal_comp(F):qubo = -np.matmul(F.transpose(), F)# Make sure matrix is symmetric to machine precisionqubo = 0.5 * (qubo + qubo.transpose())# Create json objects qubo_json = {"file_name": "qubo_tutorial.json","file_config": {"qubo": {"data": qubo, "num_variables": orig_dim},} }# Solve the optimizzation problemqci = QciClient()response_json = qci.upload_file(qubo_json)qubo_file_id = response_json["file_id"]# Setup job jsonjob_params = {"sampler_type": "dirac-1", "alpha": 1.0, "nsamples": 20,}job_json = qci.build_job_body(job_type="sample-qubo", job_params=job_params,qubo_file_id=qubo_file_id,job_name="tutorial_eqc1",job_tags=["tutorial_eqc1"],)print(job_json)# Run the jobjob_response_json = qci.process_job(job_body=job_json, job_type="sample-qubo",)print(job_response_json)results = list(job_response_json["results"]["file_config"].values())[0]energies = results["energies"]samples = results["solutions"]if True:print("Energies:", energies) q = np.array(samples[0])assert len(q) == orig_dim, "Inconsistent solution size!"fct = np.linalg.norm(q)if fct > 0:fct = 1.0 / fctv0 = fct * qv0 = v0.reshape((v0.shape[0], 1))lambda0 = np.matmul(np.matmul(v0.transpose(), -qubo),v0)[0][0]assert lambda0 >= 0, "Unexpected negative eigenvalue!"fct = np.sqrt(lambda0)if fct > 0:fct = 1.0 / fctu0 = fct * np.matmul(F, v0)u0 = u0.reshape(-1)fct = np.linalg.norm(u0)if fct > 0:fct = 1.0 / fctu0 = fct * u0return u0
One can get the first REDUCED_DIM components by applying the above function recursively,
In [10]:
F = np.array(df[feature_names])U = []for i in range(min(REDUCED_DIM, F.shape[1])):u = get_first_principal_comp(F)U.append(u)u = u.reshape((u.shape[0], 1))F = F - np.matmul(u,np.matmul(u.transpose(), F),)
Out [ ]:
{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f336fe38d25ec78cae745e'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}} Dirac allocation balance = 0 s (unmetered) Job submitted job_id='65f336ff9e39850da64a4147'-: 2024/03/14 10:42:23 RUNNING: 2024/03/14 10:42:24 COMPLETED: 2024/03/14 10:47:05 Dirac allocation balance = 0 s (unmetered) {'job_info': {'job_id': '65f336ff9e39850da64a4147', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f336fe38d25ec78cae745e'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-03-14T17:42:23.364Z', 'queued_at_rfc3339nano': '2024-03-14T17:42:23.365Z', 'running_at_rfc3339nano': '2024-03-14T17:42:24.401Z', 'completed_at_rfc3339nano': '2024-03-14T17:47:04.693Z'}, 'job_result': {'file_id': '65f3381838d25ec78cae7465', 'device_usage_s': 39}, 'details': {'status': 'COMPLETED'}}, 'results': {'file_id': '65f3381838d25ec78cae7465', 'num_parts': 1, 'num_bytes': 948, 'file_name': 'tutorial_eqc1', 'file_config': {'quadratic_unconstrained_binary_optimization_results': {'counts': [20], 'energies': [-2802848653247512600], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}}}} Energies: [-2802848653247512600] {'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f3381b38d25ec78cae7467'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}} Dirac allocation balance = 0 s (unmetered) Job submitted job_id='65f3381c9e39850da64a4149'-: 2024/03/14 10:47:08 RUNNING: 2024/03/14 10:47:09 COMPLETED: 2024/03/14 10:51:50 Dirac allocation balance = 0 s (unmetered) {'job_info': {'job_id': '65f3381c9e39850da64a4149', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f3381b38d25ec78cae7467'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-03-14T17:47:08.581Z', 'queued_at_rfc3339nano': '2024-03-14T17:47:08.582Z', 'running_at_rfc3339nano': '2024-03-14T17:47:08.762Z', 'completed_at_rfc3339nano': '2024-03-14T17:51:48.849Z'}, 'job_result': {'file_id': '65f3393438d25ec78cae746e', 'device_usage_s': 39}, 'details': {'status': 'COMPLETED'}}, 'results': {'file_id': '65f3393438d25ec78cae746e', 'num_parts': 1, 'num_bytes': 1708, 'file_name': 'tutorial_eqc1', 'file_config': {'quadratic_unconstrained_binary_optimization_results': {'counts': [12, 8], 'energies': [-60631112719794180, -60631112719794180], 'solutions': [[1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}} Energies: [-60631112719794180, -60631112719794180] {'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f3393838d25ec78cae7470'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}} Dirac allocation balance = 0 s (unmetered) Job submitted job_id='65f339399e39850da64a414b'-: 2024/03/14 10:51:53 QUEUED: 2024/03/14 10:51:54 RUNNING: 2024/03/14 10:56:33 COMPLETED: 2024/03/14 11:01:14 Dirac allocation balance = 0 s (unmetered) {'job_info': {'job_id': '65f339399e39850da64a414b', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f3393838d25ec78cae7470'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-03-14T17:51:53.083Z', 'queued_at_rfc3339nano': '2024-03-14T17:51:53.083Z', 'running_at_rfc3339nano': '2024-03-14T17:56:33.274Z', 'completed_at_rfc3339nano': '2024-03-14T18:01:13.942Z'}, 'job_result': {'file_id': '65f33b6938d25ec78cae7472', 'device_usage_s': 39}, 'details': {'status': 'COMPLETED'}}, 'results': {'file_id': '65f33b6938d25ec78cae7472', 'num_parts': 1, 'num_bytes': 3988, 'file_name': 'tutorial_eqc1', 'file_config': {'quadratic_unconstrained_binary_optimization_results': {'counts': [11, 5, 2, 1, 1], 'energies': [-31525704197734390, -31525704197734390, -31525699902767096, -31525699902767096, -31525652658126840], 'solutions': [[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]}}}} Energies: [-31525704197734390, -31525704197734390, -31525699902767096, -31525699902767096, -31525652658126840]
We can save the results and echo some information,
In [11]:
U = np.array(U)print(U.shape)np.save("U.npy", U)
Out [ ]:
(3, 1022499)
We can do a classical SVD,
In [12]:
from sklearn.decomposition import PCAfrom sklearn.preprocessing import normalizeF = np.array(df[feature_names])pca_classical = PCA(n_components=REDUCED_DIM)U_classical = pca_classical.fit_transform(F)U_classical = normalize(U_classical, axis=0, norm="l2")
and compare the results to those of the above-mentioned approach,
In [13]:
U = np.load("U.npy")#U = U_classical.transpose()print(U.shape)print(U_classical.shape)print(abs(np.diag(np.matmul(U, U_classical))))
Out [ ]:
(3, 1022499) (1022499, 3) [0.9101745 0.93294887 0.82830476]