How to Spot Investment Opportunities in China’s A-shares with Regression Analysis

In July this year, I was studying Prof. Dharmadharan's valuation course, and one of the models that is slightly simpler than DCF is multiple linear regression, which has also been confirmed to be true and usable in the Chinese A-share market. This model is simply to use a basket of stocks, such as (bank etf) of various indicators, to find out the linear relationship between the independent variable and multiple dependent variables, for example, suppose the dependent variable is PB, independent variables are PE, ROE, the bank's non-performing loan ratio. This will give us a regression equation.

For example PB = a + x* PE + yROE + zNPL ratio of banks. With this we can substitute PE, ROE, NPL ratio of individual stocks to get a predicted PB, if the predicted PB is lower than the current PB then it means currently undervalued.

Below I will combine python code, and Prof. Dharmadharan's course, together with a detailed look at this method. You are also welcome to discuss with me.

Of course, now a variety of ai tools have been able to quickly and directly carry out the analysis of linear regression, you can directly give deepseek or chatGPT stock code and a variety of indicators, ai will automatically calculate the linear regression equation.

Sample selection

In order to ensure the quality of the selected basket of stocks, Damodaran only selected market capitalization greater than $100 million and can obtain the Tier 1 asset adequacy ratio. In the end, 36 banks are obtained.

Corresponding to the Chinese A-share market, I make adjustments accordingly.

Market capitalization is greater than 100 billion RMB.
Core Tier 1 asset adequacy data is available

The final result is 21 banks.

Processing data

After getting the basket of bank stocks, the next step is to get the data of various indicators, but before that, it is important to figure out which indicators to take and find.

Assuming that the dependent variable in our study is PB (price-to-book ratio), we need to select the appropriate independent variable, which has many factors, so we will list all the possible indicators, and then use technical analysis to select the most suitable for linear regression of the independent variable.

When choosing indicators, Prof. Dharmadharan gave 3 important types of indicators

Strong correlation with the independent variable: ROE, which is the ability to make money with capital (high ROE is usually high PB) Payout Ratio (dividend payout ratio) High dividends attract more investors, raising PB
Growth indicators: historical growth, EPS (expected growth)
Risk metrics (the higher the risk, the lower the PB) : Beta, sd, tier 1 capital adequacy ratio, non-performing loans (NPL) ratio

The list below is the metrics I chose at the beginning and then it was the process of going to find the data, I chose the 2024 reported data.

At first I tried to use some python libraries to get the data, but none of them are up-to-date and some of them have bad accuracy, so I spent half a day organizing the data from the financial reports and Dongfang Finance.

Column 1	Column 2	Column 3	Column 4	Column 5	Column 6	Column 7	Column 8	Column 9	Column 10	Column 11	Column 12	Column 13	Column 14
CURRENT_PRICE	market_cap (billion cny)	BETA	PB	TIER1_RATIO	CURRENT_PRICE	NPL_RATIO	TIER1_RATIO	DIVIDEND_YIELD(%)	PAYOUT_RATIO	EPS	BOOK_VALUE_PER_SHARE	2024 Cash dividend/10 shares (RMB)	Net assets per share (yuan)

Beware of highly correlated data and overfitting. So when choosing a dependent variable, you will usually only choose a variable that is highly correlated with the independent variable.

To find Mean, Std, Dev, Min/Max, 25th/Median/75th for the data.

Look for outliers, such as the only negative value.

If the data has too much variance, or the measures are far apart, it needs to be standardized or logarithmized.

The chart below shows the market value after I standardized and logarithmized it

Screening Variables

We are now asking for correlation, and we can use an Excel spreadsheet to find the Pearson's correlation coefficient between the variables.

The r value is the Pearson correlation coefficient

∣r∣>0.8|r| > 0.8∣r∣>0.8 → highly correlated
0.5<∣r∣≤0.80.5 < |r| \leq 0.80.5<∣r∣≤0.8 → medium correlation
|r| Approximate to 0 is noise

But not more variables with high correlation are better, there may be multicollinearity.

Multicollinearity: if the correlation coefficient of 2 independent variables is extremely high ≥ 0.8, the regression coefficient may be distorted

We also have to consider the significance, or p-value, which will be explained in the next step.

The table below shows the 6 variables I chose

A high positive correlation could be a value driver, a high negative correlation could be a risk factor

	ROE	NPL_RATIO	Cash_Dividend	Beta	PE	Tire1 ratio
r value(with PB)	0.87	-0.397	0.567	-0.87	0.397	0.299

Scatter plot

The initial regression is done by OLS (Least Squares).

R2 (Coefficient of Determination) is used to measure the explanatory power of the

The R2 (Coefficient of Determination) is a measure of the model's ability to explain the dependent variable.

The R2 (Coefficient of Determination) is a measure of the model's ability to explain the dependent variable:

Significance of regression coefficients → test whether the effect of this independent variable on the dependent variable is real and not by chance.

import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

# data input
data = {
"Bank": [
"Ping An Bank", "Bank of Ningbo", "Shanghai Pudong Development Bank", "Hua Xia Bank", "Minsheng Bank", "China Merchants Bank", "Bank of Jiangsu", "Bank of Hangzhou", "Bank of Nanjing",
"Industrial Bank", "Bank of Beijing", "Bank of Shanghai", "Agricultural Bank of China", "Bank of Communications", "Industrial and Commercial Bank of China", "China Everbright Bank", "China Construction Bank",
"Bank of China", "Postal Savings Bank of China", "China CITIC Bank"
],
"PB": [
0.58, 0.89, 0.64, 0.46, 0.42, 1.15, 0.97, 0.95, 0.82, 0.68, 0.57, 0.68, 0.83, 0.63,
0.76, 0.54, 0.77, 0.7, 0.69, 0.71 
], 
"Cash_Dividend": [ 
6.08, 9, 4.1, 3.05, 1.3, 20, 5.21, 2.8, 2.11, 10.6, 4.4, 5, 2.4, 3.79, 3.08, 1.89, 
3.7, 2.42, 2.62, 3.547 
], 
"ROE": [ 
10.08, 13.59, 6.28, 8.84, 5.18, 14.49, 13.59, 16, 12.97, 9.89, 8.65, 10.01, 10.46, 
9.08, 9.88, 7.93, 10.69, 9.5, 9.84, 9.79 
], 
"NPL_Ratio": [ 
1.06, 0.76, 1.36, 1.61, 1.47, 0.94, 0.89, 0.76, 0.83, 1.07, 1.31, 1.18, 1.3, 1.31, 
1.34, 1.25, 1.34, 1.25, 0.9, 1.16 
], 
"BETA": [ 
0.94, 1.19, 0.73, 0.85, 0.73, 0.83, 0.53, 0.65, 0.57, 0.8, 
0.72, 0.69, 0.33, 0.53, 0.34, 0.73, 0.37, 0.41, 0.56, 0.7 
], 
"PE": [ 
5.91, 7.1, 10.44, 4.59, 3.0, 3.6, 7.77, 9.1, 4.87, 7.05, 
4.62, 7.03, 8.26, 6.75, 7.47, 3.47, 7.55, 7.51, 5.75, 6.61 
], "Tire1":[9.12,9.84,10.04,9.77,9.36,14.86,9.12,8.85,9.36,9.75,8.95,10.35,11.2,10.24,14.1,9.82,14.48,12.2,9.56,9.72]
}

# Construct DataFrame
df = pd.DataFrame(data)

# Features and target variables
X = df[["ROE", "NPL_Ratio", "Cash_Dividend", "BETA", "PE", "Tire1"]]
y = df["PB"]

# Fitting the original data
X_const = sm.add_constant(X)
model = sm.OLS(y, X_const).fit()

# Standardized data fitting (keeping column names)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled_const = sm.add_constant(X_scaled_df)
model_std = sm.OLS(y, X_scaled_const).fit()

# Summary results table
summary_df = pd.DataFrame({
"Unstandardized B": model.params,
"Std. Error": model.bse,
"Standardized Beta": model_std.params,
"t-value": model.tvalues,
"p-value": model.pvalues
})

# Print results
print(summary_df.round(4))

The larger the t and the smaller the P, the higher the significance.

So Tire1 is not significant.

Low significance is not necessarily useless, it may be affected by sample size, noise or covariance, and business significance is sometimes more important. I would choose the best performing model after model fitting, combining coefficient significance, MSE and R^2 to avoid overfitting or distortion caused by premature screening.

Tweak the regression

Try different variables to see how well they fit.

Symmetrical elimination of outliers, if a very high value is deleted, try to delete a very low value, to ensure that the samples are balanced.

This code generates all possible regression equations for 2-3 independent variables, combining them from the 6 possible variables I screened, and calculates the R^2

# Import required libraries
import pandas as pd
import statsmodels.api as sm
import itertools

# Build the dataset
data = {
"PB": [0.58, 0.89, 0.64, 0.46, 0.42, 1.15, 0.97, 0.95, 0.82, 0.68, 0.57, 0.68, 0.83, 0.63, 0.76, 0.54, 0.77, 0.7, 0.69, 0.71],
"ROE": [10.08, 13.59, 6.28, 8.84, 5.18, 14.49, 13.59, 16, 12.97, 9.89, 8.65, 10.01, 10.46, 9.08, 9.88, 7.93, 10.69, 9.5, 9.84, 9.79], 
"NPL_Ratio": [1.06, 0.76, 1.36, 1.61, 1.47, 0.94, 0.89, 0.76, 0.83, 1.07, 1.31, 1.18, 1.3, 1.31, 1.34, 1.25, 1.34, 1.25, 0.9, 1.16], 
"Cash_Dividend": [6.08, 9, 4.1, 3.05, 1.3, 20, 5.21, 2.8, 2.11, 10.6, 4.4, 5, 2.4, 3.79, 3.08, 1.89, 3.7, 2.42, 2.62, 3.547], 
"BETA": [0.94, 1.19, 0.73, 0.85, 0.73, 0.83, 0.53, 0.65, 0.57, 0.8, 0.72, 0.69, 0.33, 0.53, 0.34, 0.73, 0.37, 0.41, 0.56, 0.7], 
"PE": [5.91, 7.1, 10.44, 4.59, 3.0, 3.6, 7.77, 9.1, 4.87, 7.05, 4.62, 7.03, 8.26, 6.75, 7.47, 3.47, 7.55, 7.51, 5.75, 6.61],
"Tire1": [9.12, 9.84, 10.04, 9.77, 9.36, 14.86, 9.12, 8.85, 9.36, 9.75, 8.95, 10.35, 11.2, 10.24, 14.1, 9.82, 14.48, 12.2, 9.56, 9.72]
}
df = pd.DataFrame(data)

# All variable names
variables = ["ROE", "NPL_Ratio", "Cash_Dividend", "BETA", "PE", "Tire1"]

# Store regression results
results = []

# Iterate over all combinations of 2 or 3 variables
for r in [2, 3]:
for combo in itertools.combinations(variables, r):
X = df[list(combo)]
y = df["PB"]
X_const = sm.add_constant(X)
model = sm.OLS(y, X_const).fit()

# Construct regression equation string
equation_terms = [f"PB = {model.params[0]:.4f}"]
for var in combo:
coef = model.params[var]
term = f"{coef:+.4f}*{var}"
equation_terms.append(term)
equation = " ".join(equation_terms)

results.append({
"Variables": ", ".join(combo),
"R_squared": model.rsquared,
"Adjusted_R2": model.rsquared_adj,
"Equation": equation
})

# Convert the results to a DataFrame and sort them
results_df = pd.DataFrame(results).sort_values(by="R_squared", ascending=False).reset_index(drop=True)

# Output the final results
pd.set_option("display.max_colwidth", None)
print("\\nR² sorting and corresponding equations for all 2- or 3-variable combination regressions:")
print(results_df[["Variables", "R_squared", "Adjusted_R2", "Equation"]].to_string(index=False))

A high R^2 doesn't necessarily mean the model is good, it could be overfitting (especially if there are a lot of variables)
For multiple regression, look at the adjusted R^2 (Adjusted R^2)
Where n is the number of samples and p is the number of independent variables. It will penalize useless variables and prevent R² from rising blindly

Test.

I chose the 2023 data as a test set to see how well the model generalizes.

And calculated the R^2 and MSE (which is used to reflect the mean squared difference between the model's predicted and true values) for the test set

import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import r2_score, mean_squared_error

# Test set 2023 annual report
test_data = { 
"PB": [0.53, 1.07,0.87,0.51,0.41], 
"ROE": [11.27,15.08,16.22,11.56,10.8], 
"NPL_Ratio": [1.06,0.76,0.95,1.37,1.18], 
"Cash_Dividend": [7.19,6,17.75,1.97,3.56], 
"BETA": [1.261,1.306,1.341,0.364,0.625], 
"PE": [5,7.79,6.09,4.61,4.41],
"Tire1": [9.22,9.64,13.73,13.15,8.74]
}
test_df = pd.DataFrame(test_data)

# Used to store evaluation results
evaluation_results = []

# Iterate over each model
for idx, row in results_df.iterrows():
variable_list = row["Variables"].split(", ")

# Fit the model (re-select variables from the training set for regression)
X_train = df[variable_list]
y_train = df["PB"]
X_train_const = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_const).fit()

# Build the test set
X_test = test_df[variable_list]
y_test = test_df["PB"]
X_test_const = sm.add_constant(X_test)

# Model prediction
y_pred = model.predict(X_test_const)

# Score evaluation
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

# Save results
evaluation_results.append({
"Variables": row["Variables"],
"Equation": row["Equation"],
"Test_R2": r2,
"Test_MSE": mse
})

# Output the results of all models on the test set
eval_df = pd.DataFrame(evaluation_results).sort_values(by="Test_R2", ascending=False).reset_index(drop=True)
pd.set_option("display.max_colwidth", None)
print(eval_df.to_string(index=False))

The three regression equations obtained

PB = 0.7421 - 0. 5770NPL_Ratio + 0.0157PE + 0.0518*Tire1
PB = 1.1716 - 0. 4960NPL_Ratio + 0.0192PE
PB = 0.8856 - 0. 3719NPL_Ratio + 0.0182Cash_Dividend + 0.0276*PE

npl_ratio = float(input("please input NPL_Ratio: "))
pe = float(input("please input PE: "))
tire1 = float(input("please input Tire1: "))
pb1 = 0.7421 - 0.5770 * npl_ratio + 0.0157 * pe + 0.0518 * tire1
print(f"the value of estimated PB : {pb1:.4f}")

Input the latest independent variables to get the predicted PB

Spotting Investment Opportunities with Predicted Values

Using regression equations, I tested the PB of several banks

For example, CITIC Bank, one model got a PB of 0.85, one was 0.73 but the actual value is 0.62.

This time suggests that the PB may be overvalued, should be vigilant or sell (of course, this is only one of the indicators, can only be used as a reference)

Calculated using the model is the middle of July, write this share is August 10, and indeed ushered in a fall.

The second example is the logistics industry

The day before, it was predicted that Shentong Express was undervalued, and the next day, the stock price showed it, and it also rose to the PB calculated by the regression equation.

Finally, we can also summarize some patterns, such as whether the companies that are judged to be overvalued are concentrated in certain types, such as high BETA, which will also help us in our investment analysis.

ref: https://pages.stern.nyu.edu/~adamodar/pdfiles/eqnotes/webcasts/multipleanalysis/multipleanalysis.pdf

Special Acknowledgment:

搜索此博客

RolzieInvestment