Temel Bileşenler Regresyonu ve Bir Uygulaması
Abstract
Çoklu doğrusal regresyon (ÇDR) analizlerinde sıkça karşılaşılan çoklu doğrusal bağlantı (ÇDB) problemi, parametre tahminlerinin varyansını artırarak modelin güvenilirliğini ve yorumlanabilirliğini azaltmaktadır. Bu tez çalışmasının amacı, ÇDB sorununa bir çözüm alternatifi sunan Temel Bileşenler Regresyonu (TBR) yöntemini teorik olarak incelemek, bir biyoistatistik uygulama ile pratik etkinliğini göstermek ve sonuçlarını standart En Küçük Kareler (EKK) yöntemi ile karşılaştırmaktır. Çalışmada, açık kaynaklı bir klinik veri seti (n=1003) kullanılarak C-Reaktif Protein (CRP) düzeyi, 28 adet biyokimyasal ve hematolojik parametre ile modellenmiştir. Veri standardizasyonu sonrası, ÇDB teşhisi (VIF değerleri), EKK regresyonu, Temel Bileşenler Analizi (TBA) ve TBR analizleri IBM SPSS Statistics 26.0 programı ile gerçekleştirilmiştir. EKK modelinde VIF değerlerinin kabul edilebilir sınırların çok üzerinde olduğu (bazıları >1000) ve ciddi ÇDB bulunduğu tespit edilmiştir. Korelasyon matrisi üzerinden yapılan TBA sonucunda, özdeğeri 1.5'ten büyük olan ilk 6 temel bileşen seçilmiş ve bu bileşenler orijinal değişkenlerdeki toplam varyansın %60.8'ini açıklamıştır. Seçilen 6 temel bileşen kullanılarak kurulan TBR modelinde, VIF değerlerinin 1 olmasıyla ÇDB probleminin tamamen ortadan kalktığı görülmüştür. TBR modeli (R²=0.587), EKK modeline (R²=0.402) göre CRP'deki varyansın daha büyük bir kısmını açıklamıştır, ancak düzeltilmiş R² değeri daha düşük bulunmuştur (0.340 vs 0.385). Sonuç olarak, TBR'nin, ÇDB varlığında EKK'ye göre daha kararlı ve geçerli bir model sunduğu, boyut indirgeme sağladığı ancak bilgi kaybı ve yorumlama zorluğu gibi dezavantajları olduğu görülmüştür. TBR, ÇDB probleminin yaygın olduğu biyoistatistik veri analizlerinde değerli bir alternatif yöntem olarak önerilmektedir.
The problem of multicollinearity (MC), frequently encountered in multiple linear regression (MLR) analyses, compromises the reliability and interpretability of the model by inflating the variance of parameter estimates. The objective of this thesis is to theoretically examine the Principal Component Regression (PCR) method, which presents an alternative solution to the multicollinearity problem, to demonstrate its practical applicability via a biostatistical application, and to compare its results with the standard Ordinary Least Squares (OLS) method. In this study, C-Reactive Protein (CRP) levels were modeled using an open-source clinical dataset (n=1003) with 28 biochemical and hematological parameters as independent variables. Following data standardization, multicollinearity diagnostics (VIF values), OLS regression, Principal Component Analysis (PCA), and PCR analyses were conducted using IBM SPSS Statistics 26.0. The OLS model revealed Variance Inflation Factor (VIF) values significantly exceeding acceptable limits (some >1000), indicating the presence of severe multicollinearity. Based on PCA conducted via the correlation matrix, the first 6 principal components (PCs) with eigenvalues greater than 1.5 were selected; these components accounted for 60.8% of the total variance in the original variables. In the PCR model constructed using the selected 6 PCs, the multicollinearity problem was completely eliminated, as evidenced by all VIF values being exactly 1.0. The PCR model (R²=0.587) accounted for a larger proportion of the variance in CRP compared to the OLS model (R²=0.402); however, its adjusted R² value was found to be lower (0.340 vs. 0.385). Consequently, PCR was found to offer a more stable and valid model than OLS in the presence of multicollinearity and provided dimensionality reduction, although it presented drawbacks such as information loss and interpretational challenges. PCR is therefore recommended as a valuable alternative method for biostatistical data analyses where multicollinearity is prevalent.
The problem of multicollinearity (MC), frequently encountered in multiple linear regression (MLR) analyses, compromises the reliability and interpretability of the model by inflating the variance of parameter estimates. The objective of this thesis is to theoretically examine the Principal Component Regression (PCR) method, which presents an alternative solution to the multicollinearity problem, to demonstrate its practical applicability via a biostatistical application, and to compare its results with the standard Ordinary Least Squares (OLS) method. In this study, C-Reactive Protein (CRP) levels were modeled using an open-source clinical dataset (n=1003) with 28 biochemical and hematological parameters as independent variables. Following data standardization, multicollinearity diagnostics (VIF values), OLS regression, Principal Component Analysis (PCA), and PCR analyses were conducted using IBM SPSS Statistics 26.0. The OLS model revealed Variance Inflation Factor (VIF) values significantly exceeding acceptable limits (some >1000), indicating the presence of severe multicollinearity. Based on PCA conducted via the correlation matrix, the first 6 principal components (PCs) with eigenvalues greater than 1.5 were selected; these components accounted for 60.8% of the total variance in the original variables. In the PCR model constructed using the selected 6 PCs, the multicollinearity problem was completely eliminated, as evidenced by all VIF values being exactly 1.0. The PCR model (R²=0.587) accounted for a larger proportion of the variance in CRP compared to the OLS model (R²=0.402); however, its adjusted R² value was found to be lower (0.340 vs. 0.385). Consequently, PCR was found to offer a more stable and valid model than OLS in the presence of multicollinearity and provided dimensionality reduction, although it presented drawbacks such as information loss and interpretational challenges. PCR is therefore recommended as a valuable alternative method for biostatistical data analyses where multicollinearity is prevalent.
Description
Keywords
Biyoistatistik, Biostatistics
Turkish CoHE Thesis Center URL
WoS Q
Scopus Q
Source
Volume
Issue
Start Page
End Page
56