# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm
from scipy.stats import expon
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load the dataset
= pd.read_csv('weather_data.csv')
weather_df 'Date'] = pd.to_datetime(weather_df['Date']) weather_df[
Probability Theory and Random Processes
Gaining insights on weather data with Data Science and Machine Learning
In this blog post I will discuss a few examples of probability in machine learning. If you are new to probability, I recommend one of great textbooks that cover the topic and are available for free online, such as Think Bayes by Allen Downey and Bayes Rules! by Alicia A. Johnson, Miles Q. Ott, and Mine Dogucu.
Classification algorithms algorithms can estimate \(n \times k\) class membership probabilities for each dataset, where n is the number of data points in the dataset and k is the number of classes in the training dataset. Similarly, the Gaussian Mixtures clustering algorithm can generate \(n \times k\) cluster label probabilities.
Besides a data point and the Gaussian Mixtures models can estimate cluster membership probability. point , especially Logistic Regression and Naive Bayes. Every classification algorithm can estimate probabilities of belonging to each class.
\(\Huge P(A\vert B)={\frac {P(B\vert A)P(A)}{P(B)}}\)
Understanding Weather Forecasting with the help of Probability theory and Random Processes
Introduction
Machine learning plays a pivotal role in understanding and predicting various natural phenomena, and weather forecasting is no exception. To harness the power of machine learning for weather data analysis, it is essential to have a solid foundation in random processes and probability theory. These fundamental concepts are the building blocks that enable us to model the inherent uncertainty and variability present in weather data.
Weather data, such as temperature, humidity, wind speed, and precipitation, exhibit random behavior due to the complex interplay of atmospheric processes. Random processes are mathematical models used to describe the evolution of these variables over time. These processes capture the idea that weather conditions are not deterministic but rather stochastic, influenced by a multitude of factors, including geographical location, time of year, and local phenomena.
Probability theory, on the other hand, provides the framework to quantify and reason about uncertainty in weather data. It allows us to assign probabilities to different weather outcomes and make informed predictions based on observed data. For example, we can calculate the probability of rain on a given day or estimate the likelihood of extreme weather events, such as hurricanes or heatwaves, occurring in a specific region.
Machine learning techniques, such as regression, classification, and time series analysis, heavily rely on probabilistic and random process models to extract meaningful insights from weather data. By incorporating these techniques, we can build predictive models that not only provide accurate weather forecasts but also account for uncertainty, enabling better decision-making in various applications like agriculture, transportation, and disaster management.
In this context, the weather dataset you are using serves as a valuable source of information for exploring and applying these concepts. By understanding random processes and probability theory, you can leverage machine learning to unlock the potential hidden within weather data, improving the accuracy and reliability of weather forecasts and facilitating data-driven decision-making in various sectors that rely on weather information.
Data Loading and Basic Visualization
Exploratory Data Analysis
#Exploratory Data Analysis
#Histograms and KDE (Kernel Density Estimation) plots for Temperature and Humidity.
'Temperature'], kde=True, color='blue', label='Temperature')
sns.histplot(weather_df['Humidity'], kde=True, color='green', label='Humidity', alpha=0.5)
sns.histplot(weather_df[
plt.legend() plt.show()
# Pair Plot to visualize all variables together.
='Weather_Condition')
sns.pairplot(weather_df, hue plt.show()
Probability Distributions
#Normal Distribution Fit for Temperature.
'Temperature'], kde=False, color='blue', label='Temperature')
sns.histplot(weather_df[
# Fitting a normal distribution and plotting it
= norm.fit(weather_df['Temperature'])
mean, std = plt.xlim()
xmin, xmax = np.linspace(xmin, xmax, 100)
x = norm.pdf(x, mean, std)
p * max(weather_df['Temperature'].value_counts()), 'k', linewidth=2)
plt.plot(x, p
= "Fit results: mean = %.2f, std = %.2f" % (mean, std)
title
plt.title(title)
plt.show()
#Exponential Distribution Fit for Humidity.
from scipy.stats import expon
# Plotting histogram
'Humidity'], kde=False, color='green', label='Humidity')
sns.histplot(weather_df[
# Fitting an exponential distribution and plotting it
= expon.fit(weather_df['Humidity'])
params = plt.xlim()
xmin, xmax = np.linspace(xmin, xmax, 100)
x = expon.pdf(x, *params)
p * max(weather_df['Humidity'].value_counts()), 'k', linewidth=2)
plt.plot(x, p
plt.show()
Time Series Analysis
Temperature Trend over Time.
#Checking the temperature trned against time
'Date')['Temperature'].plot()
weather_df.set_index("Temperature Trend Over Time")
plt.title("Temperature")
plt.ylabel( plt.show()
Probability Theory in Action
#Conditional Probability: Probability of High Humidity given Rainy Weather.
= weather_df['Humidity'] > 80
high_humidity = weather_df['Weather_Condition'] == 'Rainy'
rainy_days = np.mean(high_humidity[rainy_days])
prob_high_humidity_given_rain print(f"Probability of High Humidity given Rainy Weather: {prob_high_humidity_given_rain}")
Probability of High Humidity given Rainy Weather: 0.27586206896551724
#Joint Distribution: Temperature and Humidity.
=weather_df, x='Temperature', y='Humidity', kind='kde', color='red')
sns.jointplot(data plt.show()
Correlation Analysis
#Correlation Heatmap
# Selecting only numerical columns for correlation analysis
= weather_df.select_dtypes(include=[np.number])
numerical_weather_df
# Plotting the correlation heatmap
=True, cmap='coolwarm')
sns.heatmap(numerical_weather_df.corr(), annot"Correlation Heatmap")
plt.title( plt.show()
Linear Regression for Temperature Prediction
#Model Training and Evaluation.
# Preparing data for linear regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# Preparing data for linear regression
= weather_df[['Humidity']]
X = weather_df['Temperature']
y = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_test, y_train, y_test
# Training the model
= LinearRegression()
model
model.fit(X_train, y_train)
# Making predictions
= model.predict(X_test)
y_pred
# Plotting actual vs predicted values
='blue', label='Actual')
plt.scatter(X_test, y_test, color='red', label='Predicted')
plt.scatter(X_test, y_pred, color'Humidity')
plt.xlabel('Temperature')
plt.ylabel('Actual vs Predicted Temperature')
plt.title(
plt.legend()
plt.show()
# Model evaluation
= mean_squared_error(y_test, y_pred)
mse print("Mean Squared Error:", mse)
Mean Squared Error: 24.010797366937965
Markov Chain for Weather Condition Transitions
#Let's simulate a simple Markov chain to model the transitions between different weather conditions.
import pandas as pd
# Calculating transition probabilities
= weather_df['Weather_Condition'].unique()
weather_conditions = pd.DataFrame(index=weather_conditions, columns=weather_conditions).fillna(0)
transition_matrix
for (prev, curr) in zip(weather_df['Weather_Condition'], weather_df['Weather_Condition'][1:]):
+= 1
transition_matrix.at[prev, curr]
# Normalizing the rows to sum to 1
= transition_matrix.div(transition_matrix.sum(axis=1), axis=0)
transition_matrix
# Display the transition matrix
print(transition_matrix)
Windy Snowy Cloudy Foggy Sunny Rainy
Windy 0.155116 0.165017 0.161716 0.194719 0.181518 0.141914
Snowy 0.196667 0.166667 0.153333 0.166667 0.156667 0.160000
Cloudy 0.144543 0.182891 0.171091 0.165192 0.188791 0.147493
Foggy 0.199346 0.124183 0.199346 0.147059 0.140523 0.189542
Sunny 0.167247 0.167247 0.222997 0.167247 0.128920 0.146341
Rainy 0.131488 0.179931 0.211073 0.166090 0.141869 0.169550
This code calculates the probabilities of transitioning from one weather condition to another. It’s a basic form of a Markov chain.
Monte Carlo Simulation for Temperature Extremes
#Use Monte Carlo simulation to estimate the probability of extreme temperature events.
0)
np.random.seed(= 10000
num_simulations = 0
extreme_temp_count = 30 # Define what you consider as extreme temperature
extreme_temp_threshold
for _ in range(num_simulations):
= np.random.choice(weather_df['Temperature'])
simulated_temp if simulated_temp > extreme_temp_threshold:
+= 1
extreme_temp_count
= extreme_temp_count / num_simulations
probability_of_extreme_temp print(f"Probability of Extreme Temperature (> {extreme_temp_threshold}°C): {probability_of_extreme_temp}")
Probability of Extreme Temperature (> 30°C): 0.0226
This Monte Carlo simulation randomly samples temperatures from the dataset and calculates the probability of encountering temperatures above a certain threshold.
Obtain the logistic function mathematically
Step 1. Write out the linear regression equation
\(\Huge y=\beta_0+\beta_1 x_1+...+\beta_n x_n\)
Step 2. The logistic regression equation is the same as above except output is log odds
\(\Huge log(odds)=\beta_0+\beta_1 x_1+...+\beta_n x_n\)
Step 3. Exponentiate both sides of the logistic regression equation to get odds
\(\Huge odds=e^{\beta_0+\beta_1 x_1+...+\beta_n x_n}\)
Step 4. Write out the probability equation
\(\Huge p=\frac{odds}{1+odds}\)
Step 5. Plug odds (from step 3) into the probability equation
\(\Huge p=\frac{e^{\beta_0+\beta_1 x_1+...+\beta_n x_n}}{1+e^{\beta_0+\beta_1 x_1+...+\beta_n x_n}}\)
Step 6. Divide the numerator and denominator by the odds (from step 3)
\(\Huge p=\frac{1}{1+e^{-(\beta_0+\beta_1 x_1+...+\beta_n x_n)}}\)
\(\Huge P(A\vert B)={\frac {P(B\vert A)P(A)}{P(B)}}\)
Conclusion
This analysis shows more sophisticated ways of applying probability theory and random processes to the weather dataset, providing insights into weather patterns and temperature predictions.
Summarized the findings from the above analysis. Discussed the relevance of these probabilistic models in understanding weather patterns.