Question: How does the presence of murders in various Texas cities, in particular El Paso, relate to other crime statistics?¶

# Max Smith, March 20, 2016
# mcsmith12@earlham.edu
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
%matplotlib inline

handle = open('elpaso.csv','r')
dta = pd.DataFrame.from_csv(handle)

# Now get rid of the bad rows
dta = dta.drop(['TEXAS', 'Offenses Known to Law Enforcement', 'by City, 2013', 'City'])

# Do the same for the bad columns
del dta['Unnamed: 4']
del dta['Unnamed: 5']

# Rename the colums into what they actually are
# Keep in mind that these names will be 
# referenced later using dmatrices.
dta = dta.rename(columns = {'Unnamed: 1': 'Population',
                        'Unnamed: 2': 'Violent_Crime',
                        'Unnamed: 3': 'Murder',
                        'Unnamed: 6': 'Robbery',
                        'Unnamed: 7': 'Agg_Assault',
                        'Unnamed: 8': 'Property_Crime',
                        'Unnamed: 9': 'Burglary',
                        'Unnamed: 10': 'Larceny_Theft',
                        'Unnamed: 11': 'Motor_Veh_Theft',
                        'Unnamed: 12': 'Arson'
                        }
            )

# Turn all the objects in the table into floats
dta = dta.convert_objects(convert_numeric=True)
dta.dtypes

/Users/maxsmith1200/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:28: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

Population         float64
Violent_Crime      float64
Murder             float64
Robbery            float64
Agg_Assault        float64
Property_Crime     float64
Burglary           float64
Larceny_Theft      float64
Motor_Veh_Theft    float64
Arson              float64
dtype: object

Note that I did this because I am importing the data from an excel, which was saved as a csv. These csv entries were data-typed as objects before I converted them to float64.¶

# Check the data
# dta

I will add another column named Murder_bool, which will be a one if the Murder column is greater than 0, otherwise it will be 0.¶

# I want to 'boolianize' the murder values, just 
# to check for the presence of murders, not sheer number.
dta['Murder_bool'] = (dta.Murder > 0).astype(int)

# Check for the new column to make sure it 
# is working correctly.
dta.Murder_bool

El Paso Stats
Abernathy           0
Abilene             1
Addison             1
Alamo               0
Alamo Heights       0
Alice               0
Allen               0
Alton               1
Alvarado            0
Alvin               1
Amarillo            1
Andrews             0
Angleton            1
Anna                0
Anson               0
Anthony             0
Aransas Pass        0
Arcola              0
Argyle              0
Arlington           1
Arp                 0
Atlanta             0
Aubrey              0
Austin              1
Azle                0
Baird               0
Balch Springs       1
Balcones Heights    0
Ballinger           0
Bangs               0
                   ..
Weslaco             1
West                0
West Columbia       0
West Orange         0
Westover Hills      0
West Tawakoni       0
Westworth           0
Wharton             0
Whitehouse          0
White Oak           0
Whitesboro          0
White Settlement    0
Whitewright         0
Whitney             0
Wichita Falls       1
Willow Park         0
Wills Point         0
Wilmer              0
Windcrest           0
Wink                0
Winnsboro           0
Winters             0
Wolfforth           0
Woodbranch          0
Woodville           0
Woodway             0
Wortham             0
Wylie               1
Yoakum              0
Yorktown            0
Name: Murder_bool, dtype: int64

# Check out some useful stats on the cities with 
# murders(Murder_bool = 1) and those without(Murder_bool = 0).
dta.groupby('Murder_bool').mean()

# How does this relate to El Paso in particular?
dta.loc['El Paso']

Population         679700.0
Violent_Crime        2522.0
Murder                 10.0
Robbery               457.0
Agg_Assault          1879.0
Property_Crime      15558.0
Burglary             1771.0
Larceny_Theft       12993.0
Motor_Veh_Theft       794.0
Arson                  73.0
Murder_bool             1.0
Name: El Paso, dtype: float64

At first glance it seems that compared to the average city with murder in Texas, El Paso has more of everything and is not particularly safer than any other Texas City.¶

# Here I use dmatrices to set up logistic regression.
# All the entries to the right of the tilde are inputs
# while Murder_bool is what I am attempting to model.
y, X = dmatrices('Murder_bool ~ Population + Violent_Crime + Robbery + Agg_Assault + Property_Crime \
                 + Burglary + Larceny_Theft + Motor_Veh_Theft + Arson',
                  dta, return_type="dataframe")

# The following command turns a 1-D df into an array.
y = np.ravel(y)

model = LogisticRegression()
model = model.fit(X, y)
# Check the accuracy
print(model.score(X, y), y.mean())

0.84219269103 0.25415282392

Notice that the null error rate is much lower than the correlation percentage of 84%, which means there is a significant correlation between the presence of murders and the variables we are modelling against.¶

# Now lets look a the coefficients.
pd.DataFrame(list(zip(X.columns, np.transpose(model.coef_))))
# Note that py35 treats zips as an 
# iterable, so pd.DataFrame could not
# use the zip. To get around this, 
# I had to turn the zip into a list.

Intermediate Conclusions:¶

The population coefficient is basically 0, which means whether or not a murder will occur is independent of population size.¶

Murder is positively correlated with violent crime; We expect that more violent crime will correspond to more murder.¶

The coefficient for robbery is potentially surprising. We find that more robbery leads to less murder.¶

The two strongest coefficients are from Violent Crime and Agg Assault. The more violent crime, the more the chance of murder. The more Aggravated Assualt, however, the less likely a murder will occur.¶

We see that compared to the average city with murder in Texas, El Paso has more of everything and is not particularly safer than any other Texas City.¶

Thoughts:¶

Since population seemed to have no correlation, I will take it out of the equation. Additionally, it may be important to scale down everything by the population. I'll try this below.¶

dta['Violent_Crime_per_Capita'] = dta.Violent_Crime/dta.Population
dta['Murder_per_capita'] = dta.Murder/dta.Population
dta['Robbery_per_Capita'] = dta.Robbery/dta.Population
dta['Agg_Assualt_per_Capita'] = dta.Agg_Assault/dta.Population
dta['Property_Crime_per_Capita'] = dta.Property_Crime/dta.Population
dta['Burglary_per_Capita'] = dta.Burglary/dta.Population
dta['Larceny_Theft_per_Capita'] = dta.Larceny_Theft/dta.Population
dta['Motor_Veh_Theft_per_Capita'] = dta.Motor_Veh_Theft/dta.Population
dta['Arson_per_Capita'] = dta.Arson/dta.Population

del dta['Population']
del dta['Violent_Crime']
del dta['Murder']
del dta['Robbery']
del dta['Agg_Assault']
del dta['Property_Crime']
del dta['Burglary']
del dta['Larceny_Theft']
del dta['Motor_Veh_Theft']
del dta['Arson']

# Prep for logistic regression again.
y, X = dmatrices('Murder_bool ~ Violent_Crime_per_Capita + Robbery_per_Capita + Agg_Assualt_per_Capita\
                + Property_Crime_per_Capita + Burglary_per_Capita + Larceny_Theft_per_Capita\
                + Motor_Veh_Theft_per_Capita + Arson_per_Capita',
                  dta, return_type="dataframe")

# The following command turns a 1-D df into an array.
y = np.ravel(y)

model = LogisticRegression()
model = model.fit(X, y)
# Check the accuracy
print(model.score(X, y), y.mean())

0.74584717608 0.25415282392

pd.DataFrame(list(zip(X.columns, np.transpose(model.coef_))))

Now we see positive correlations with everything except Arson. This is much more what we expected.¶

dta.groupby('Murder_bool').mean()

dta.loc['El Paso']

Murder_bool                   1.000000
Violent_Crime_per_Capita      0.003710
Murder_per_capita             0.000015
Robbery_per_Capita            0.000672
Agg_Assualt_per_Capita        0.002764
Property_Crime_per_Capita     0.022890
Burglary_per_Capita           0.002606
Larceny_Theft_per_Capita      0.019116
Motor_Veh_Theft_per_Capita    0.001168
Arson_per_Capita              0.000107
Name: El Paso, dtype: float64

Here we see that the murder rate per capita is much lower than the the average for cities that had murders. Aggravated assualt is also lower, as is property crime, burglary, larceny, motor vehicle theft, and arson.¶

	Population	Violent_Crime	Murder	Robbery	Agg_Assault	Property_Crime	Burglary	Larceny_Theft	Motor_Veh_Theft	Arson
Murder_bool
0	8066.574279	20.485588	0.000000	3.095344	15.221729	213.164080	43.702882	159.046563	10.414634	0.824053
1	97374.692810	501.163399	5.509804	168.503268	291.712418	3943.281046	809.888889	2816.261438	317.130719	18.117647

	0	1
0	Intercept	[-1.05550340161]
1	Population	[-1.12389682932e-06]
2	Violent_Crime	[0.112688390049]
3	Robbery	[-0.0817692329595]
4	Agg_Assault	[-0.121192357767]
5	Property_Crime	[0.000610325531079]
6	Burglary	[0.00530687779442]
7	Larceny_Theft	[-0.000557323729591]
8	Motor_Veh_Theft	[-0.00413922853344]
9	Arson	[-0.0324502574337]

	0	1
0	Intercept	[-0.552787110652]
1	Violent_Crime_per_Capita	[0.101829443719]
2	Robbery_per_Capita	[0.0413803070789]
3	Agg_Assualt_per_Capita	[0.041410847484]
4	Property_Crime_per_Capita	[0.7469159364]
5	Burglary_per_Capita	[0.077856763464]
6	Larceny_Theft_per_Capita	[0.605172501344]
7	Motor_Veh_Theft_per_Capita	[0.0638866715927]
8	Arson_per_Capita	[-0.000219760519256]

	Violent_Crime_per_Capita	Murder_per_capita	Robbery_per_Capita	Agg_Assualt_per_Capita	Property_Crime_per_Capita	Burglary_per_Capita	Larceny_Theft_per_Capita	Motor_Veh_Theft_per_Capita	Arson_per_Capita
Murder_bool
0	0.002663	0.00000	0.000339	0.002049	0.025765	0.006098	0.018447	0.001221	0.000114
1	0.003619	0.00011	0.000720	0.002450	0.033007	0.006874	0.024320	0.001813	0.000113