Dimensionality reduction on Dirac

Device: Dirac-1

Introduction

In machine learning problems, we often have to start with a large number of features. We can use a dimensionality reduction approach to reduce the number of features, especially when the features are used for an unsupervised algorithm such as clustering. In what follows we present a tutorial on using QCi's technology to implement a QUBO based Singular Value Decomposition (SVD) algorithm.

Methodology

Principal Component Analysis (PCA) is a dimensionality reduction technique which is based on a Singular Value Decomposition (SVD) of the data matrix. Let us have a dataset with NN samples and dd features, represented by a N×dN \times d matrix FF. We can decompose FF as,

F=UDVTF = U D V^T

where UU is an N×dN \times d matrix with orthonormal columns, DD is a diagonal square d×dd \times d matrix, and VV is an orthogonal matrix with orthonormal columns and rows. Note that columns of UU form an orthonormal basis {uk}k{1,2,...,d}{\{\bf{u_k}\}}_{k \in \{1, 2,..., d\}} for the vector space that is spanned by columns FF. It can be shown that the larger the kk-th diagonal element of DD, the more the contribution of the basis element uk{\bf{u_k}}. Thus, one can rank the basis elements uk{\bf{u_k}} based on the value of DkkD_{kk}, and choose a subset of most important basis elements; this yields a lower dimensional representation of the data matrix FF. Assuming that D00>D11>...>DddD_{00} > D_{11} > ... > D_{dd}, u0{\bf{u_0}} is said to be the first principal component of the original matrix FF.

It can be shown that

uk=1λkFvk{\bf{u_k}} = \frac{1}{\sqrt{\lambda_k}} F {\bf{v_k}}

where λk\lambda_k and vk{\bf{v_k}} denote the kk-th eigenvalue and normalized eigenvector of the orthogonal d×dd \times d matrix G:=FTFG := F^T F. Note too that vk{\bf{v_k}} is in fact the kk-th column of matrix VV and that λk\sqrt{\lambda_k} is the kk-th diagonal element of matrix DD. Finding the first principal component of FF is then equivalent to finding the eigenvector of GG corresponding to its maximum eigenvalue, that is the eigenvector of G-G corresponding to its minimum eigenvalue. An estimation to the first principal component of FF can then be obtained by solving a QUBO as,

minqqTGq\min_{\bf{q}} {-\bf{q}^{T}} G {\bf{q}}

where q{\bf{q}} is a dd-dimensional binary vector. The first principal component of FF is,

u0=1λ0Fq{\bf{u_0}} = \frac{1}{\sqrt{\lambda_0}} F {\bf{q}}

where λ0=v0TGv0\lambda_0 = {\bf{v_0}}^T G {\bf{v_0}}, and v0=qq{\bf{v_0}} = \frac{{\bf{q}}}{||{\bf{q}}||}. One can then replace each column of FF with its component orthogonal to u0{\bf{u_0}}, and repeat the above-mentioned process to get the second principal component u1{\bf{u_1}}. This can be repeated to get any desired number of principal components of FF, say u0,u1,...,ud{\bf{u_0}}, {\bf{u_1}},..., {\bf{u_{d^\prime}}}, where d<dd^\prime < d. To remove a principal component u{\bf{u}} from a feature matrix FF, we have,

Fnew=FuuTFF^{new} = F - {\bf{u}} {\bf{u}^{T}} F

Note that this is an estimation to a full PCA as the vector qq in the above equation is a binary vector. This estimation is equivalent to quantizing the directions in feature space such that a potentially large but finite set of directions can be chosen by PCA. These are the directions that correspond to the dd-dimensional binary vector qq.

Medicare Prescription Data

We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber

Clean data

We start by cleaning the dataset.

In [1]:

import pandas as pd
# Input
INP_FILE = "Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv"
OUT_FILE = "cleaned_medicare_data.csv"
CON_VARS = [
"total_claim_count",
"total_30_day_fill_count",
"total_drug_cost",
"total_day_supply",
"bene_count",
"total_claim_count_ge65",
"total_30_day_fill_count_ge65",
"total_drug_cost_ge65",
"total_day_supply_ge65",
"bene_count_ge65",
"brand_claim_count",
"brand_drug_cost",
"generic_claim_count",
"generic_drug_cost",
"other_claim_count",
"other_drug_cost",
"mapd_claim_count",
"mapd_drug_cost",
"pdp_claim_count",
"pdp_drug_cost",
"lis_claim_count",
"lis_drug_cost",
"nonlis_claim_count",
"nonlis_drug_cost",
"opioid_claim_count",
"opioid_drug_cost",
"opioid_day_supply",
"opioid_bene_count",
"opioid_prescriber_rate",
"antibiotic_claim_count",
"antibiotic_drug_cost",
"antibiotic_bene_count",
"hrm_claim_count_ge65",
"hrm_drug_cost_ge65",
"hrm_bene_count_ge65",
"antipsych_claim_count_ge65",
"antipsych_drug_cost_ge65",
"antipsych_bene_count_ge65",
"average_age_of_beneficiaries",
"beneficiary_age_less_65_count",
"beneficiary_age_65_74_count",
"beneficiary_age_75_84_count",
"beneficiary_age_greater_84_count",
"beneficiary_female_count",
"beneficiary_male_count",
"beneficiary_race_white_count",
"beneficiary_race_black_count",
"beneficiary_race_asian_pi_count",
"beneficiary_race_hispanic_count",
"beneficiary_race_nat_ind_count",
"beneficiary_race_other_count",
"beneficiary_nondual_count",
"beneficiary_dual_count",
"beneficiary_average_risk_score",
]
VALID_PROVIDER_MI = [
"A",
"M",
"J",
"L",
"R",
"S",
"E",
"D",
"C",
"B",
"K",
"P",
"W",
"H",
"T",
"G",
"F",
"N",
"V",
"I",
"O",
"Y",
"Z",
"U",
"Q",
"X",
]
VALID_GEN = ["F", "M", "Other", "Unknown"]
VALID_ENTITIES = ["I", "O"]
VALID_DESC_FLAGS = ["S", "T"]
VALID_ENROLLS = ["E", "N", "O"]
# Some utilities
def convert_to_float(x):
try:
return float(x)
except:
return None
def convert_to_int(x):
try:
return int(float(x))
except:
return None
# Read data
df = pd.read_csv(INP_FILE, on_bad_lines = "skip", low_memory=False)
# Clean categorical variables
df["nppes_provider_mi"] = df["nppes_provider_mi"].fillna("Unknown")
df["nppes_provider_mi"] = df["nppes_provider_mi"].apply(
lambda x: x if x in VALID_PROVIDER_MI else "Unknown"
)
df["nppes_credentials"] = df["nppes_credentials"].fillna("Unknown")
df["nppes_credentials"] = df["nppes_credentials"].apply(
lambda x: str(x).replace(".", "")
)
cred_hash = {
"MEDICAL DOCTOR": "MD",
"NURSE PRACTITIONER": "NP",
}
df["nppes_credentials"] = df["nppes_credentials"].apply(
lambda x: cred_hash[x] if x in cred_hash else x,
)
df["nppes_provider_gender"] = df["nppes_provider_gender"].fillna("Unknown")
df["nppes_provider_gender"] = df["nppes_provider_gender"].apply(
lambda x: x if x in VALID_GEN else "Other",
)
df["nppes_entity_code"] = df["nppes_entity_code"].apply(
lambda x: x if x in VALID_ENTITIES else "Unknown",
)
df["nppes_provider_zip5"] = df["nppes_provider_zip5"].fillna("Unknown")
df["nppes_provider_country"] = df["nppes_provider_country"].apply(
lambda x: "US" if x == "US" else "Other",
)
df["description_flag"] = df["description_flag"].apply(
lambda x: x if x in VALID_DESC_FLAGS else "Unknown",
)
df["medicare_prvdr_enroll_status"] = df["medicare_prvdr_enroll_status"].apply(
lambda x: x if x in VALID_ENROLLS else "Unknown",
)
# Treat missing beneficiary count as it cannot be zero
df["bene_count"] = df["bene_count"].apply(
convert_to_int
).fillna(-1)
tmp_df = df.groupby(
"specialty_description", as_index=False,
)["bene_count"].mean()
bene_count_hash = dict(
zip(
tmp_df["specialty_description"],
tmp_df["bene_count"],
)
)
df["bene_count"] = df.apply(
lambda x: x["bene_count"] if x[
"bene_count"
] > 0 else bene_count_hash[
x["specialty_description"]
],
axis=1,
)
# Treat continuous variables
for item in CON_VARS:
df[item] = df[item].apply(
convert_to_float
).fillna(0.0)
# Filter out invalid states
df = df[
~df["nppes_provider_state"].isin(
["XX", "E", "N", "S"]
)
]
# Output
df.to_csv(OUT_FILE, index=False)

Generate features

We then generate features. The categorical features are encoded using the average value of a few important variables in each category.

In [2]:

import pandas as pd
# Input
INP_FILE = "cleaned_medicare_data.csv"
OUT_FILE = "medicare_features.csv"
# Set some parameters
CAT_VARS = [
"nppes_provider_mi",
#"nppes_credentials", # This is rather messy, so ignoring it.
"nppes_provider_gender",
"nppes_entity_code",
"nppes_provider_city",
"nppes_provider_zip5",
#"nppes_provider_country", # Almost all cases are US
"specialty_description",
"medicare_prvdr_enroll_status",
"nppes_provider_state",
]
CON_VARS = [
"total_claim_count",
"total_30_day_fill_count",
"total_drug_cost",
"total_day_supply",
"bene_count",
"total_claim_count_ge65",
"total_30_day_fill_count_ge65",
"total_drug_cost_ge65",
"total_day_supply_ge65",
"bene_count_ge65",
"brand_claim_count",
"brand_drug_cost",
"generic_claim_count",
"generic_drug_cost",
"other_claim_count",
"other_drug_cost",
"mapd_claim_count",
"mapd_drug_cost",
"pdp_claim_count",
"pdp_drug_cost",
"lis_claim_count",
"lis_drug_cost",
"nonlis_claim_count",
"nonlis_drug_cost",
"opioid_claim_count",
"opioid_drug_cost",
"opioid_day_supply",
"opioid_bene_count",
"antibiotic_claim_count",
"antibiotic_drug_cost",
"antibiotic_bene_count",
"hrm_claim_count_ge65",
"hrm_drug_cost_ge65",
"hrm_bene_count_ge65",
"antipsych_claim_count_ge65",
"antipsych_drug_cost_ge65",
"antipsych_bene_count_ge65",
"average_age_of_beneficiaries",
"beneficiary_age_less_65_count",
"beneficiary_age_65_74_count",
"beneficiary_age_75_84_count",
"beneficiary_age_greater_84_count",
"beneficiary_female_count",
"beneficiary_male_count",
"beneficiary_race_white_count",
"beneficiary_race_black_count",
"beneficiary_race_asian_pi_count",
"beneficiary_race_hispanic_count",
"beneficiary_race_nat_ind_count",
"beneficiary_race_other_count",
"beneficiary_nondual_count",
"beneficiary_dual_count",
"beneficiary_average_risk_score",
]
# Read and clean data
df = pd.read_csv(INP_FILE, low_memory=False)
# Embed categorical features
embedded_cat_features = []
for item in CAT_VARS:
tmp_df = df.groupby(item, as_index=False).agg(
{
"opioid_claim_count": "mean",
"opioid_drug_cost": "mean",
"opioid_day_supply": "mean",
"opioid_bene_count": "mean",
"opioid_prescriber_rate": "mean",
}
).rename(
columns={
"opioid_claim_count": "%s_opioid_claim_count" % item,
"opioid_drug_cost": "%s_opioid_drug_cost" % item,
"opioid_day_supply": "%s_opioid_day_supply" % item,
"opioid_bene_count": "%s_opioid_bene_count" % item,
"opioid_prescriber_rate": "%s_opioid_prescriber_rate" % item,
}
)
df = df.merge(tmp_df, how="left", on=item)
embedded_cat_features += [
"%s_opioid_claim_count" % item,
"%s_opioid_drug_cost" % item,
"%s_opioid_day_supply" % item,
"%s_opioid_bene_count" % item,
"%s_opioid_prescriber_rate" % item,
]
# Drop unembedded categorical variables and some others
df = df[["npi"] + CON_VARS + embedded_cat_features]
# Write features file
df.to_csv(OUT_FILE, index=False)

Dimensionality Reduction

Once the features are generated, we can implement the above-mentioned SVD algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe.

In [3]:

# Import libs
import sys
import os
import time
import numpy as np
import pandas as pd
from qci_client import QciClient
# Define some parameters
FEATURES_FILE = "medicare_features.csv"
REDUCED_DIM = 3
# Read features
df = pd.read_csv(FEATURES_FILE, low_memory=False)

We now print the feature names and get the total count of features in the dataset,

In [4]:

feature_names = list(set(df.columns) - {"npi"})
orig_dim = len(feature_names)
print(
"Original dimension is %d; reduced dimension will be %d" % (
orig_dim,
REDUCED_DIM,
)
)

Out [ ]:

Original dimension is 93; reduced dimension will be 3

We now define a function that gets the first principal component of features by solving a QUBO,

In [9]:

def get_first_principal_comp(F):
qubo = -np.matmul(F.transpose(), F)
# Make sure matrix is symmetric to machine precision
qubo = 0.5 * (qubo + qubo.transpose())
# Create json objects
qubo_json = {
"file_name": "qubo_tutorial.json",
"file_config": {
"qubo": {"data": qubo, "num_variables": orig_dim},
}
}
# Solve the optimizzation problem
qci = QciClient()
response_json = qci.upload_file(qubo_json)
qubo_file_id = response_json["file_id"]
# Setup job json
job_params = {
"sampler_type": "dirac-1",
"alpha": 1.0,
"nsamples": 20,
}
job_json = qci.build_job_body(
job_type="sample-qubo",
job_params=job_params,
qubo_file_id=qubo_file_id,
job_name="tutorial_eqc1",
job_tags=["tutorial_eqc1"],
)
print(job_json)
# Run the job
job_response_json = qci.process_job(
job_body=job_json, job_type="sample-qubo",
)
print(job_response_json)
results = list(job_response_json["results"]["file_config"].values())[0]
energies = results["energies"]
samples = results["solutions"]
if True:
print("Energies:", energies)
q = np.array(samples[0])
assert len(q) == orig_dim, "Inconsistent solution size!"
fct = np.linalg.norm(q)
if fct > 0:
fct = 1.0 / fct
v0 = fct * q
v0 = v0.reshape((v0.shape[0], 1))
lambda0 = np.matmul(
np.matmul(v0.transpose(), -qubo),
v0
)[0][0]
assert lambda0 >= 0, "Unexpected negative eigenvalue!"
fct = np.sqrt(lambda0)
if fct > 0:
fct = 1.0 / fct
u0 = fct * np.matmul(F, v0)
u0 = u0.reshape(-1)
fct = np.linalg.norm(u0)
if fct > 0:
fct = 1.0 / fct
u0 = fct * u0
return u0

One can get the first REDUCED_DIM components by applying the above function recursively,

In [10]:

F = np.array(df[feature_names])
U = []
for i in range(min(REDUCED_DIM, F.shape[1])):
u = get_first_principal_comp(F)
U.append(u)
u = u.reshape((u.shape[0], 1))
F = F - np.matmul(
u,
np.matmul(u.transpose(), F),
)

Out [ ]:

{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f336fe38d25ec78cae745e'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}
Dirac allocation balance = 0 s (unmetered)
Job submitted job_id='65f336ff9e39850da64a4147'-: 2024/03/14 10:42:23
RUNNING: 2024/03/14 10:42:24
COMPLETED: 2024/03/14 10:47:05
Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '65f336ff9e39850da64a4147', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f336fe38d25ec78cae745e'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-03-14T17:42:23.364Z', 'queued_at_rfc3339nano': '2024-03-14T17:42:23.365Z', 'running_at_rfc3339nano': '2024-03-14T17:42:24.401Z', 'completed_at_rfc3339nano': '2024-03-14T17:47:04.693Z'}, 'job_result': {'file_id': '65f3381838d25ec78cae7465', 'device_usage_s': 39}, 'details': {'status': 'COMPLETED'}}, 'results': {'file_id': '65f3381838d25ec78cae7465', 'num_parts': 1, 'num_bytes': 948, 'file_name': 'tutorial_eqc1', 'file_config': {'quadratic_unconstrained_binary_optimization_results': {'counts': [20], 'energies': [-2802848653247512600], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}}}}
Energies: [-2802848653247512600]
{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f3381b38d25ec78cae7467'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}
Dirac allocation balance = 0 s (unmetered)
Job submitted job_id='65f3381c9e39850da64a4149'-: 2024/03/14 10:47:08
RUNNING: 2024/03/14 10:47:09
COMPLETED: 2024/03/14 10:51:50
Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '65f3381c9e39850da64a4149', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f3381b38d25ec78cae7467'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-03-14T17:47:08.581Z', 'queued_at_rfc3339nano': '2024-03-14T17:47:08.582Z', 'running_at_rfc3339nano': '2024-03-14T17:47:08.762Z', 'completed_at_rfc3339nano': '2024-03-14T17:51:48.849Z'}, 'job_result': {'file_id': '65f3393438d25ec78cae746e', 'device_usage_s': 39}, 'details': {'status': 'COMPLETED'}}, 'results': {'file_id': '65f3393438d25ec78cae746e', 'num_parts': 1, 'num_bytes': 1708, 'file_name': 'tutorial_eqc1', 'file_config': {'quadratic_unconstrained_binary_optimization_results': {'counts': [12, 8], 'energies': [-60631112719794180, -60631112719794180], 'solutions': [[1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}
Energies: [-60631112719794180, -60631112719794180]
{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f3393838d25ec78cae7470'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}
Dirac allocation balance = 0 s (unmetered)
Job submitted job_id='65f339399e39850da64a414b'-: 2024/03/14 10:51:53
QUEUED: 2024/03/14 10:51:54
RUNNING: 2024/03/14 10:56:33
COMPLETED: 2024/03/14 11:01:14
Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '65f339399e39850da64a414b', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '65f3393838d25ec78cae7470'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-03-14T17:51:53.083Z', 'queued_at_rfc3339nano': '2024-03-14T17:51:53.083Z', 'running_at_rfc3339nano': '2024-03-14T17:56:33.274Z', 'completed_at_rfc3339nano': '2024-03-14T18:01:13.942Z'}, 'job_result': {'file_id': '65f33b6938d25ec78cae7472', 'device_usage_s': 39}, 'details': {'status': 'COMPLETED'}}, 'results': {'file_id': '65f33b6938d25ec78cae7472', 'num_parts': 1, 'num_bytes': 3988, 'file_name': 'tutorial_eqc1', 'file_config': {'quadratic_unconstrained_binary_optimization_results': {'counts': [11, 5, 2, 1, 1], 'energies': [-31525704197734390, -31525704197734390, -31525699902767096, -31525699902767096, -31525652658126840], 'solutions': [[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]}}}}
Energies: [-31525704197734390, -31525704197734390, -31525699902767096, -31525699902767096, -31525652658126840]

We can save the results and echo some information,

In [11]:

U = np.array(U)
print(U.shape)
np.save("U.npy", U)

Out [ ]:

(3, 1022499)

We can do a classical SVD,

In [12]:

from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
F = np.array(df[feature_names])
pca_classical = PCA(n_components=REDUCED_DIM)
U_classical = pca_classical.fit_transform(F)
U_classical = normalize(U_classical, axis=0, norm="l2")

and compare the results to those of the above-mentioned approach,

In [13]:

U = np.load("U.npy")
#U = U_classical.transpose()
print(U.shape)
print(U_classical.shape)
print(
abs(
np.diag(
np.matmul(U, U_classical)
)
)
)

Out [ ]:

(3, 1022499)
(1022499, 3)
[0.9101745  0.93294887 0.82830476]