1) Import relevant libraries

In [443]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split

2) Load data from local drive

In [4]:
df = pd.read_excel ('Rainfall.xlsx')
In [5]:
df
Out[5]:
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
0 2016-01-01 SydneyAirport 18.6 26.5 0.0 11.0 12.3 NE 41.0 W ... 61 48.0 1016.4 1013.6 6.0 4 22.3 25.4 No No
1 2016-01-02 SydneyAirport 18.1 25.3 0.0 6.8 3.3 E 31.0 S ... 68 57.0 1013.4 1012.5 7.0 7 21.1 23.8 No No
2 2016-01-03 SydneyAirport 19.9 24.3 0.0 8.0 0.0 SE 56.0 SSE ... 65 75.0 1015.2 1014.5 8.0 8 22.2 22.5 No Yes
3 2016-01-04 SydneyAirport 20.2 25.4 3.6 6.4 0.7 ESE 48.0 E ... 85 94.0 1016.9 1016.5 8.0 8 20.7 18.7 Yes Yes
4 2016-01-05 SydneyAirport 17.6 21.0 51.2 0.0 0.0 ESE 54.0 ESE ... 83 91.0 1016.9 1014.9 8.0 8 19.4 19.9 Yes Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
361 2016-12-27 SydneyAirport 21.3 26.8 0.0 11.2 6.0 NE 46.0 S ... 81 75.0 1014.7 1013.4 7.0 7 23.0 25.2 No No
362 2016-12-28 SydneyAirport 21.2 30.5 0.0 7.8 9.4 NNE 65.0 NE ... 66 46.0 1013.0 1009.3 6.0 6 24.5 28.9 No No
363 2016-12-29 SydneyAirport 22.4 38.2 0.0 10.4 4.8 NE 33.0 WNW ... 50 23.0 1007.0 1003.9 7.0 6 29.0 37.4 No No
364 2016-12-30 SydneyAirport 23.2 35.6 0.0 9.2 0.5 NNE 43.0 WSW ... 78 38.0 1003.5 1001.3 8.0 7 25.4 31.9 No No
365 2016-12-31 SydneyAirport 22.6 26.3 0.2 8.2 4.8 S 39.0 S ... 83 75.0 1003.9 1003.8 6.0 7 24.2 25.5 No No

366 rows × 23 columns

3) Explore dataset

Summary statistics

In [6]:
df.describe()
Out[6]:
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
count 366.000000 366.000000 366.000000 361.000000 362.000000 365.000000 365.000000 365.000000 366.000000 365.000000 365.000000 365.000000 365.000000 366.000000 366.000000 366.000000
mean 15.314754 24.121858 3.457923 5.864266 7.578453 47.158904 18.446575 25.345205 62.081967 51.457534 1017.131233 1014.617260 4.238356 4.221311 19.095082 22.409290
std 4.601572 5.069821 10.512465 2.926939 3.781429 14.019294 8.895642 9.130992 16.845270 19.489966 7.259913 7.223554 2.755745 2.657156 4.512051 4.850321
min 5.400000 11.600000 0.000000 0.000000 0.000000 20.000000 0.000000 0.000000 18.000000 8.000000 998.100000 993.600000 0.000000 0.000000 8.000000 11.100000
25% 11.900000 20.400000 0.000000 3.400000 4.725000 37.000000 13.000000 19.000000 50.000000 37.000000 1012.600000 1009.900000 1.000000 1.000000 15.800000 19.000000
50% 15.450000 23.800000 0.000000 5.600000 8.700000 46.000000 17.000000 24.000000 62.000000 50.000000 1017.500000 1015.400000 5.000000 4.000000 19.400000 22.100000
75% 19.175000 27.000000 0.750000 8.000000 10.400000 56.000000 24.000000 31.000000 74.000000 62.000000 1021.600000 1019.000000 7.000000 7.000000 22.300000 25.400000
max 27.500000 40.600000 95.600000 15.800000 13.500000 120.000000 50.000000 61.000000 98.000000 97.000000 1038.800000 1035.400000 8.000000 8.000000 34.200000 37.400000

Get relevant columns only (using all metric independent variables only)

In [404]:
df1 = df[['MinTemp','MaxTemp','Evaporation',
         'Sunshine','WindGustSpeed','WindSpeed9am','WindSpeed3pm',
         'Humidity9am','Humidity3pm','Pressure9am','Pressure3pm','Cloud9am',
          'Cloud3pm','Temp9am','Temp3pm','RainToday','RainTomorrow']]
In [405]:
df1
Out[405]:
MinTemp MaxTemp Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
0 18.6 26.5 11.0 12.3 41.0 9.0 28.0 61 48.0 1016.4 1013.6 6.0 4 22.3 25.4 No No
1 18.1 25.3 6.8 3.3 31.0 6.0 19.0 68 57.0 1013.4 1012.5 7.0 7 21.1 23.8 No No
2 19.9 24.3 8.0 0.0 56.0 26.0 30.0 65 75.0 1015.2 1014.5 8.0 8 22.2 22.5 No Yes
3 20.2 25.4 6.4 0.7 48.0 13.0 20.0 85 94.0 1016.9 1016.5 8.0 8 20.7 18.7 Yes Yes
4 17.6 21.0 0.0 0.0 54.0 24.0 31.0 83 91.0 1016.9 1014.9 8.0 8 19.4 19.9 Yes Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
361 21.3 26.8 11.2 6.0 46.0 20.0 24.0 81 75.0 1014.7 1013.4 7.0 7 23.0 25.2 No No
362 21.2 30.5 7.8 9.4 65.0 15.0 43.0 66 46.0 1013.0 1009.3 6.0 6 24.5 28.9 No No
363 22.4 38.2 10.4 4.8 33.0 13.0 22.0 50 23.0 1007.0 1003.9 7.0 6 29.0 37.4 No No
364 23.2 35.6 9.2 0.5 43.0 7.0 11.0 78 38.0 1003.5 1001.3 8.0 7 25.4 31.9 No No
365 22.6 26.3 8.2 4.8 39.0 22.0 22.0 83 75.0 1003.9 1003.8 6.0 7 24.2 25.5 No No

366 rows × 17 columns

Distribution plot using pandas

In [249]:
df1.hist(figsize= (30,20))
Out[249]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000020D36D8FC48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D36CA0E48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D36D96748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D368D8688>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020D364EA808>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D36CD7508>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D36E85108>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D368F79C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020D36E40BC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D369EC448>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D369E1788>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D3661E188>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020D364F0788>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D364AFE88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D364C0748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020D36406FC8>]],
      dtype=object)

Distribution plot using pyplot

In [521]:
'''
column_name = df1.columns

n = 0
for e in column_name:
    n = n + 1
    plt.figure(n)
    plt.hist(df1[e], bins = 30)
    plt.legend([e])
    plt.show()  
''' 
Out[521]:
'\ncolumn_name = df1.columns\n\nn = 0\nfor e in column_name:\n    n = n + 1\n    plt.figure(n)\n    plt.hist(df1[e], bins = 30)\n    plt.legend([e])\n    plt.show()  \n'

Distribution plot using seaborn

In [522]:
'''
n = 0
for e in column_name:
    if e == 'RainTomorrow' or e == 'RainToday':
        print (e + '\tis categorical data')
    else:
        print (e)
        n = n + 1
        plt.figure(n)
        sns.distplot(df1[e])
        plt.legend([e])
        plt.show()
'''
Out[522]:
"\nn = 0\nfor e in column_name:\n    if e == 'RainTomorrow' or e == 'RainToday':\n        print (e + '\tis categorical data')\n    else:\n        print (e)\n        n = n + 1\n        plt.figure(n)\n        sns.distplot(df1[e])\n        plt.legend([e])\n        plt.show()\n"

Matrix of scatter plot using seaborn - color by 'RainTomorrow' variable

In [126]:
sns.pairplot(df1.iloc[:,[0,1,2,3,4,5,6,7,16]], hue='RainTomorrow')
Out[126]:
<seaborn.axisgrid.PairGrid at 0x20d319759c8>

comment:

there seem to be some separation between the two classes (either 'yes' or 'no') for the variable 'Sunshine' and 'Humidity9am'.

In [124]:
sns.pairplot(df1.iloc[:,[8,9,10,11,12,13,14,16]], hue='RainTomorrow')
Out[124]:
<seaborn.axisgrid.PairGrid at 0x20d2ef7f7c8>

comment:

there seem to be some separation between the two classes (either 'yes' or 'no') for the variable 'Cloud9am' and 'Cloud3am'.

Correlation matrix

In [130]:
sns.heatmap(df1.corr(), annot=False)
Out[130]:
<matplotlib.axes._subplots.AxesSubplot at 0x20d2af339c8>

Check for missing values

In [406]:
x = df1.isnull().sum()
x.plot(kind='bar', title='Missing values in each variables')
x
Out[406]:
MinTemp          0
MaxTemp          0
Evaporation      5
Sunshine         4
WindGustSpeed    1
WindSpeed9am     1
WindSpeed3pm     1
Humidity9am      0
Humidity3pm      1
Pressure9am      1
Pressure3pm      1
Cloud9am         1
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

4) Preprocessing

Mean imputation

In [407]:
for e in column_name:
    if df1[e].isnull().sum() > 0:
        for i in range(len(df1)):
            if np.isnan(df1.loc[i,e]):
                df1.loc[i,e] = np.mean(df1.loc[:,e])
            else:
                continue
C:\Users\Bravo\Anaconda3_2020_02\lib\site-packages\pandas\core\indexing.py:965: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s

can also use df.fillna (value) as shown below

In [409]:
# df1['Evaporation'] = df1['Evaporation'].fillna(np.mean(df1['Evaporation']))
In [408]:
df1.isnull().sum()
Out[408]:
MinTemp          0
MaxTemp          0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

comment:

no more missing values now - REJOICE!

5) Modelling

Encoding

In [426]:
df1['RainToday_No'] = df1['RainToday'].apply(lambda x: 0 if x == 'No' else 1)

df1['RainTomorrow_encode'] = df1['RainTomorrow'].apply(lambda x: 0 if x == 'No' else 1)
C:\Users\Bravo\Anaconda3_2020_02\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:\Users\Bravo\Anaconda3_2020_02\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
In [430]:
df1
Out[430]:
MinTemp MaxTemp Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow RainToday_encode RainTomorrow_encode
0 18.6 26.5 11.0 12.3 41.0 9.0 28.0 61 48.0 1016.4 1013.6 6.0 4 22.3 25.4 No No 0 0
1 18.1 25.3 6.8 3.3 31.0 6.0 19.0 68 57.0 1013.4 1012.5 7.0 7 21.1 23.8 No No 0 0
2 19.9 24.3 8.0 0.0 56.0 26.0 30.0 65 75.0 1015.2 1014.5 8.0 8 22.2 22.5 No Yes 0 1
3 20.2 25.4 6.4 0.7 48.0 13.0 20.0 85 94.0 1016.9 1016.5 8.0 8 20.7 18.7 Yes Yes 1 1
4 17.6 21.0 0.0 0.0 54.0 24.0 31.0 83 91.0 1016.9 1014.9 8.0 8 19.4 19.9 Yes Yes 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
361 21.3 26.8 11.2 6.0 46.0 20.0 24.0 81 75.0 1014.7 1013.4 7.0 7 23.0 25.2 No No 0 0
362 21.2 30.5 7.8 9.4 65.0 15.0 43.0 66 46.0 1013.0 1009.3 6.0 6 24.5 28.9 No No 0 0
363 22.4 38.2 10.4 4.8 33.0 13.0 22.0 50 23.0 1007.0 1003.9 7.0 6 29.0 37.4 No No 0 0
364 23.2 35.6 9.2 0.5 43.0 7.0 11.0 78 38.0 1003.5 1001.3 8.0 7 25.4 31.9 No No 0 0
365 22.6 26.3 8.2 4.8 39.0 22.0 22.0 83 75.0 1003.9 1003.8 6.0 7 24.2 25.5 No No 0 0

366 rows × 19 columns

Drop original column that has been encoded (drop [RainToday] and [RainTomorrow])

In [ ]:
df1 = df1.drop(['RainToday', 'RainTomorrow'], axis = 1)
In [450]:
df1
Out[450]:
MinTemp MaxTemp Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday_encode RainTomorrow_encode
0 18.6 26.5 11.0 12.3 41.0 9.0 28.0 61 48.0 1016.4 1013.6 6.0 4 22.3 25.4 0 0
1 18.1 25.3 6.8 3.3 31.0 6.0 19.0 68 57.0 1013.4 1012.5 7.0 7 21.1 23.8 0 0
2 19.9 24.3 8.0 0.0 56.0 26.0 30.0 65 75.0 1015.2 1014.5 8.0 8 22.2 22.5 0 1
3 20.2 25.4 6.4 0.7 48.0 13.0 20.0 85 94.0 1016.9 1016.5 8.0 8 20.7 18.7 1 1
4 17.6 21.0 0.0 0.0 54.0 24.0 31.0 83 91.0 1016.9 1014.9 8.0 8 19.4 19.9 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
361 21.3 26.8 11.2 6.0 46.0 20.0 24.0 81 75.0 1014.7 1013.4 7.0 7 23.0 25.2 0 0
362 21.2 30.5 7.8 9.4 65.0 15.0 43.0 66 46.0 1013.0 1009.3 6.0 6 24.5 28.9 0 0
363 22.4 38.2 10.4 4.8 33.0 13.0 22.0 50 23.0 1007.0 1003.9 7.0 6 29.0 37.4 0 0
364 23.2 35.6 9.2 0.5 43.0 7.0 11.0 78 38.0 1003.5 1001.3 8.0 7 25.4 31.9 0 0
365 22.6 26.3 8.2 4.8 39.0 22.0 22.0 83 75.0 1003.9 1003.8 6.0 7 24.2 25.5 0 0

366 rows × 17 columns

Define 'y' (target variable) and 'x' (independent variable)

In [507]:
y = df1[['RainTomorrow_encode']] # target attributes 
X = df1.iloc[:, 0:16] # input attributes
In [508]:
y.head()
Out[508]:
RainTomorrow_encode
0 0
1 0
2 1
3 1
4 1
In [509]:
X.head()
Out[509]:
MinTemp MaxTemp Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday_encode
0 18.6 26.5 11.0 12.3 41.0 9.0 28.0 61 48.0 1016.4 1013.6 6.0 4 22.3 25.4 0
1 18.1 25.3 6.8 3.3 31.0 6.0 19.0 68 57.0 1013.4 1012.5 7.0 7 21.1 23.8 0
2 19.9 24.3 8.0 0.0 56.0 26.0 30.0 65 75.0 1015.2 1014.5 8.0 8 22.2 22.5 0
3 20.2 25.4 6.4 0.7 48.0 13.0 20.0 85 94.0 1016.9 1016.5 8.0 8 20.7 18.7 1
4 17.6 21.0 0.0 0.0 54.0 24.0 31.0 83 91.0 1016.9 1014.9 8.0 8 19.4 19.9 1

Split data into Training and Test set

In [510]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0, stratify=y)
In [511]:
X_train.head()
Out[511]:
MinTemp MaxTemp Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday_encode
7 15.8 25.8 6.6 12.100000 31.0 13.0 22.0 50 47.0 1020.0 1019.0 4.238356 1 22.1 24.5 0
181 5.6 17.8 1.8 7.578453 41.0 17.0 19.0 60 26.0 1021.1 1014.9 1.000000 7 9.7 15.9 0
198 10.0 20.1 1.8 3.200000 24.0 13.0 15.0 91 66.0 1031.2 1026.9 7.000000 7 14.3 19.0 0
305 11.8 23.4 10.6 12.100000 39.0 9.0 24.0 32 33.0 1014.4 1010.0 4.000000 1 16.7 21.3 0
240 8.3 19.3 3.8 10.700000 33.0 22.0 13.0 43 45.0 1022.8 1020.9 1.000000 2 13.4 17.4 0
In [456]:
y_train.head()
Out[456]:
RainTomorrow_encode
7 0
181 0
198 0
305 0
240 0

Building SVM model

In [467]:
from sklearn import svm

m = svm.SVC()
m.fit(X_train, np.ravel(y_train))
Out[467]:
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

comment:

default SVM model uses RBF kernel

Prediction on Test dataset

In [468]:
m.predict(X_test)
Out[468]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)
In [469]:
m.score(X_test, y_test)
Out[469]:
0.7727272727272727
In [470]:
m.score?
In [473]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, m.predict(X_test), labels=[0,1])
Out[473]:
array([[85,  0],
       [25,  0]], dtype=int64)
In [476]:
m.predict(X_train)
Out[476]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)
In [486]:
from sklearn.metrics import classification_report

predictions =(m.predict(X_test)).astype("int32")

print(classification_report(y_test, predictions))
              precision    recall  f1-score   support

           0       0.77      1.00      0.87        85
           1       0.00      0.00      0.00        25

    accuracy                           0.77       110
   macro avg       0.39      0.50      0.44       110
weighted avg       0.60      0.77      0.67       110

C:\Users\Bravo\Anaconda3_2020_02\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

comment:

overall model accuracy is 77%

recall for event 0 (No rain tomorrow event) is 100% - model will only predict that it WILL NOT rain tomorrow

recall for event 1 (it will rain tomorrow) is 0% - model is useless at predicting whether it will rain tomorrow!

Try building SVM model with linear kernel

In [483]:
m1 = svm.SVC(kernel = 'linear')
m1.fit(X_train, np.ravel(y_train))
Out[483]:
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
In [484]:
m1.score(X_test, y_test)
Out[484]:
0.8454545454545455
In [490]:
confusion_matrix(y_test, m1.predict(X_test), labels=[0,1])
Out[490]:
array([[79,  6],
       [11, 14]], dtype=int64)
In [488]:
predictions =(m1.predict(X_test)).astype("int32")

print(classification_report(y_test, predictions))
              precision    recall  f1-score   support

           0       0.88      0.93      0.90        85
           1       0.70      0.56      0.62        25

    accuracy                           0.85       110
   macro avg       0.79      0.74      0.76       110
weighted avg       0.84      0.85      0.84       110

comment:

overall model accuracy has increased to 85%

recall for event 0 (No rain tomorrow event) is 93% - model is still better at predicting that it WILL NOT rain tomorrow

recall for event 1 (it will rain tomorrow) is 56% - model is still useless at predicting whether it will rain tomorrow! But some improvement nontheless!