Question: How does the presence of murders in various Texas cities, in particular El Paso, relate to other crime statistics?

In [1]:
# Max Smith, March 20, 2016
# mcsmith12@earlham.edu
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
%matplotlib inline
In [2]:
handle = open('elpaso.csv','r')
dta = pd.DataFrame.from_csv(handle)

# Now get rid of the bad rows
dta = dta.drop(['TEXAS', 'Offenses Known to Law Enforcement', 'by City, 2013', 'City'])

# Do the same for the bad columns
del dta['Unnamed: 4']
del dta['Unnamed: 5']

# Rename the colums into what they actually are
# Keep in mind that these names will be 
# referenced later using dmatrices.
dta = dta.rename(columns = {'Unnamed: 1': 'Population',
                        'Unnamed: 2': 'Violent_Crime',
                        'Unnamed: 3': 'Murder',
                        'Unnamed: 6': 'Robbery',
                        'Unnamed: 7': 'Agg_Assault',
                        'Unnamed: 8': 'Property_Crime',
                        'Unnamed: 9': 'Burglary',
                        'Unnamed: 10': 'Larceny_Theft',
                        'Unnamed: 11': 'Motor_Veh_Theft',
                        'Unnamed: 12': 'Arson'
                        }
            )

# Turn all the objects in the table into floats
dta = dta.convert_objects(convert_numeric=True)
dta.dtypes
/Users/maxsmith1200/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:28: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
Out[2]:
Population         float64
Violent_Crime      float64
Murder             float64
Robbery            float64
Agg_Assault        float64
Property_Crime     float64
Burglary           float64
Larceny_Theft      float64
Motor_Veh_Theft    float64
Arson              float64
dtype: object

Note that I did this because I am importing the data from an excel, which was saved as a csv. These csv entries were data-typed as objects before I converted them to float64.

In [3]:
# Check the data
# dta

I will add another column named Murder_bool, which will be a one if the Murder column is greater than 0, otherwise it will be 0.

In [4]:
# I want to 'boolianize' the murder values, just 
# to check for the presence of murders, not sheer number.
dta['Murder_bool'] = (dta.Murder > 0).astype(int)

# Check for the new column to make sure it 
# is working correctly.
dta.Murder_bool
Out[4]:
El Paso Stats
Abernathy           0
Abilene             1
Addison             1
Alamo               0
Alamo Heights       0
Alice               0
Allen               0
Alton               1
Alvarado            0
Alvin               1
Amarillo            1
Andrews             0
Angleton            1
Anna                0
Anson               0
Anthony             0
Aransas Pass        0
Arcola              0
Argyle              0
Arlington           1
Arp                 0
Atlanta             0
Aubrey              0
Austin              1
Azle                0
Baird               0
Balch Springs       1
Balcones Heights    0
Ballinger           0
Bangs               0
                   ..
Weslaco             1
West                0
West Columbia       0
West Orange         0
Westover Hills      0
West Tawakoni       0
Westworth           0
Wharton             0
Whitehouse          0
White Oak           0
Whitesboro          0
White Settlement    0
Whitewright         0
Whitney             0
Wichita Falls       1
Willow Park         0
Wills Point         0
Wilmer              0
Windcrest           0
Wink                0
Winnsboro           0
Winters             0
Wolfforth           0
Woodbranch          0
Woodville           0
Woodway             0
Wortham             0
Wylie               1
Yoakum              0
Yorktown            0
Name: Murder_bool, dtype: int64
In [5]:
# Check out some useful stats on the cities with 
# murders(Murder_bool = 1) and those without(Murder_bool = 0).
dta.groupby('Murder_bool').mean()
Out[5]:
Population Violent_Crime Murder Robbery Agg_Assault Property_Crime Burglary Larceny_Theft Motor_Veh_Theft Arson
Murder_bool
0 8066.574279 20.485588 0.000000 3.095344 15.221729 213.164080 43.702882 159.046563 10.414634 0.824053
1 97374.692810 501.163399 5.509804 168.503268 291.712418 3943.281046 809.888889 2816.261438 317.130719 18.117647
In [6]:
# How does this relate to El Paso in particular?
dta.loc['El Paso']
Out[6]:
Population         679700.0
Violent_Crime        2522.0
Murder                 10.0
Robbery               457.0
Agg_Assault          1879.0
Property_Crime      15558.0
Burglary             1771.0
Larceny_Theft       12993.0
Motor_Veh_Theft       794.0
Arson                  73.0
Murder_bool             1.0
Name: El Paso, dtype: float64

At first glance it seems that compared to the average city with murder in Texas, El Paso has more of everything and is not particularly safer than any other Texas City.

In [7]:
# Here I use dmatrices to set up logistic regression.
# All the entries to the right of the tilde are inputs
# while Murder_bool is what I am attempting to model.
y, X = dmatrices('Murder_bool ~ Population + Violent_Crime + Robbery + Agg_Assault + Property_Crime \
                 + Burglary + Larceny_Theft + Motor_Veh_Theft + Arson',
                  dta, return_type="dataframe")

# The following command turns a 1-D df into an array.
y = np.ravel(y)
In [8]:
model = LogisticRegression()
model = model.fit(X, y)
# Check the accuracy
print(model.score(X, y), y.mean())
0.84219269103 0.25415282392

Notice that the null error rate is much lower than the correlation percentage of 84%, which means there is a significant correlation between the presence of murders and the variables we are modelling against.

In [9]:
# Now lets look a the coefficients.
pd.DataFrame(list(zip(X.columns, np.transpose(model.coef_))))
# Note that py35 treats zips as an 
# iterable, so pd.DataFrame could not
# use the zip. To get around this, 
# I had to turn the zip into a list.
Out[9]:
0 1
0 Intercept [-1.05550340161]
1 Population [-1.12389682932e-06]
2 Violent_Crime [0.112688390049]
3 Robbery [-0.0817692329595]
4 Agg_Assault [-0.121192357767]
5 Property_Crime [0.000610325531079]
6 Burglary [0.00530687779442]
7 Larceny_Theft [-0.000557323729591]
8 Motor_Veh_Theft [-0.00413922853344]
9 Arson [-0.0324502574337]

Intermediate Conclusions:

The population coefficient is basically 0, which means whether or not a murder will occur is independent of population size.

Murder is positively correlated with violent crime; We expect that more violent crime will correspond to more murder.

The coefficient for robbery is potentially surprising. We find that more robbery leads to less murder.

The two strongest coefficients are from Violent Crime and Agg Assault. The more violent crime, the more the chance of murder. The more Aggravated Assualt, however, the less likely a murder will occur.

We see that compared to the average city with murder in Texas, El Paso has more of everything and is not particularly safer than any other Texas City.

Thoughts:

Since population seemed to have no correlation, I will take it out of the equation. Additionally, it may be important to scale down everything by the population. I'll try this below.

In [10]:
dta['Violent_Crime_per_Capita'] = dta.Violent_Crime/dta.Population
dta['Murder_per_capita'] = dta.Murder/dta.Population
dta['Robbery_per_Capita'] = dta.Robbery/dta.Population
dta['Agg_Assualt_per_Capita'] = dta.Agg_Assault/dta.Population
dta['Property_Crime_per_Capita'] = dta.Property_Crime/dta.Population
dta['Burglary_per_Capita'] = dta.Burglary/dta.Population
dta['Larceny_Theft_per_Capita'] = dta.Larceny_Theft/dta.Population
dta['Motor_Veh_Theft_per_Capita'] = dta.Motor_Veh_Theft/dta.Population
dta['Arson_per_Capita'] = dta.Arson/dta.Population
In [11]:
del dta['Population']
del dta['Violent_Crime']
del dta['Murder']
del dta['Robbery']
del dta['Agg_Assault']
del dta['Property_Crime']
del dta['Burglary']
del dta['Larceny_Theft']
del dta['Motor_Veh_Theft']
del dta['Arson']
In [12]:
# Prep for logistic regression again.
y, X = dmatrices('Murder_bool ~ Violent_Crime_per_Capita + Robbery_per_Capita + Agg_Assualt_per_Capita\
                + Property_Crime_per_Capita + Burglary_per_Capita + Larceny_Theft_per_Capita\
                + Motor_Veh_Theft_per_Capita + Arson_per_Capita',
                  dta, return_type="dataframe")

# The following command turns a 1-D df into an array.
y = np.ravel(y)

model = LogisticRegression()
model = model.fit(X, y)
# Check the accuracy
print(model.score(X, y), y.mean())
0.74584717608 0.25415282392
In [13]:
pd.DataFrame(list(zip(X.columns, np.transpose(model.coef_))))
Out[13]:
0 1
0 Intercept [-0.552787110652]
1 Violent_Crime_per_Capita [0.101829443719]
2 Robbery_per_Capita [0.0413803070789]
3 Agg_Assualt_per_Capita [0.041410847484]
4 Property_Crime_per_Capita [0.7469159364]
5 Burglary_per_Capita [0.077856763464]
6 Larceny_Theft_per_Capita [0.605172501344]
7 Motor_Veh_Theft_per_Capita [0.0638866715927]
8 Arson_per_Capita [-0.000219760519256]

Now we see positive correlations with everything except Arson. This is much more what we expected.

In [14]:
dta.groupby('Murder_bool').mean()
Out[14]:
Violent_Crime_per_Capita Murder_per_capita Robbery_per_Capita Agg_Assualt_per_Capita Property_Crime_per_Capita Burglary_per_Capita Larceny_Theft_per_Capita Motor_Veh_Theft_per_Capita Arson_per_Capita
Murder_bool
0 0.002663 0.00000 0.000339 0.002049 0.025765 0.006098 0.018447 0.001221 0.000114
1 0.003619 0.00011 0.000720 0.002450 0.033007 0.006874 0.024320 0.001813 0.000113
In [15]:
dta.loc['El Paso']
Out[15]:
Murder_bool                   1.000000
Violent_Crime_per_Capita      0.003710
Murder_per_capita             0.000015
Robbery_per_Capita            0.000672
Agg_Assualt_per_Capita        0.002764
Property_Crime_per_Capita     0.022890
Burglary_per_Capita           0.002606
Larceny_Theft_per_Capita      0.019116
Motor_Veh_Theft_per_Capita    0.001168
Arson_per_Capita              0.000107
Name: El Paso, dtype: float64

Here we see that the murder rate per capita is much lower than the the average for cities that had murders. Aggravated assualt is also lower, as is property crime, burglary, larceny, motor vehicle theft, and arson.

In [ ]: