Tucson Crime Analysis


Authors: Nathan Tebbs, Andrew Hicks, Cole Hageman
[python, data-science, ml, regression, geospatial, tucson]


Cover

Abstract

This post will cover in some detail a project taken on by myself and two peers. To see a more detailed and academic presentation of our findings please see:

Overview

This project analyzes how reported crime patterns in Tucson relate to (1) neighborhood income and (2) the presence of streetlights. I focused on two main hypotheses:

  • Crime vs. Wealth: do thefts and violent crimes occur more often in richer or poorer areas?
  • Crime vs. Streetlights: does streetlight presence influence crime rates, especially at night?

The pipeline combines data cleaning + feature engineering, exploratory visualization, and several models (Ridge Regression, Random Forest, Logistic Regression, and OLS).

Data sources (inputs)

  • Tucson Police Reported Crimes (CSV)
  • Tucson Police Arrests (CSV)
  • City of Tucson Streetlight Locations (CSV)
  • Neighborhood Income (CSV)

Loading + preprocessing

A key step was normalizing time fields (extracting an hour from TimeOccur) and building a time period label to support night-crime analysis.

# Convert DateOccurred to datetime
crime_df["DateOccurred"] = pd.to_datetime(crime_df["DateOccurred"], errors="coerce")

def extract_hour(time_str):
    try:
        time_str = str(time_str).strip()
        if time_str.isdigit() and 3 <= len(time_str) <= 4:
            time_str = time_str.zfill(4)
            hour = int(time_str[:2])
            if 0 <= hour <= 23:
                return hour
        return np.nan
    except (ValueError, TypeError):
        return np.nan

crime_df["Hour"] = crime_df["TimeOccur"].apply(extract_hour)

def categorize_time(hour):
    if pd.isna(hour): return "Unknown"
    elif 5 <= hour < 12: return "Morning"
    elif 12 <= hour < 17: return "Afternoon"
    elif 17 <= hour < 22: return "Evening"
    else: return "Night"

crime_df["Time_Period"] = crime_df["Hour"].apply(categorize_time)

# Filter years and remove rows missing critical fields
crime_df = crime_df[crime_df["Year"].isin([2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025])]
crime_df = crime_df.dropna(subset=["Ward", "UCRDescription", "DateOccurred", "Hour"])
crime_df.loc[:, "Ward"] = crime_df["Ward"].astype(int)

Building ward-level features + integrating datasets

To comprare areas consistently, we aggregated everything by ward:

  • Crime_Count: total reported crimes per ward
  • Arrest_Count: total arrests perward
  • Night_Crime_Prop: proportion of crimes occurring at night
  • Streetlight_Count: streetlights assigned to by wards via nearest spatial join using GeoPandas
# Aggregate crime and arrest counts by ward
crime_by_ward = crime_df.groupby("Ward", observed=False).size().reset_index(name="Crime_Count")
arrest_by_ward = arrest_df.groupby("WARD", observed=False).size().reset_index(name="Arrest_Count")

# Proportion of nighttime crimes per ward
night_crimes = (
    crime_df[crime_df["Time_Period"] == "Night"]
    .groupby("Ward", observed=False).size()
    .reset_index(name="Night_Crime_Count")
)
total_crimes = crime_df.groupby("Ward", observed=False).size().reset_index(name="Total_Crime_Count")
night_crime_prop = night_crimes.merge(total_crimes, on="Ward")
night_crime_prop["Night_Crime_Prop"] = night_crime_prop["Night_Crime_Count"] / night_crime_prop["Total_Crime_Count"]
night_crime_prop = night_crime_prop[["Ward", "Night_Crime_Prop"]]

# Merge with income data (ward key)
merged_df = income_df.merge(crime_by_ward, left_on="WARD", right_on="Ward", how="left")
merged_df = merged_df.merge(arrest_by_ward, on="WARD", how="left")
merged_df = merged_df.merge(night_crime_prop, left_on="WARD", right_on="Ward", how="left")
merged_df = merged_df.drop(columns=["Ward"], errors="ignore")

# Spatial join: assign streetlights to wards using nearest arrest geometry
arrest_gdf = gpd.GeoDataFrame(
    arrest_df,
    geometry=[Point(xy) for xy in zip(arrest_df["X"], arrest_df["Y"])],
    crs="EPSG:2868",
)
streetlight_gdf = gpd.GeoDataFrame(
    streetlight_df,
    geometry=[Point(xy) for xy in zip(streetlight_df["X"], streetlight_df["Y"])],
    crs="EPSG:2868",
)

streetlight_by_ward = gpd.sjoin_nearest(
    streetlight_gdf,
    arrest_gdf[["WARD", "geometry"]],
    how="left",
    max_distance=1000,
)
streetlight_count = streetlight_by_ward.groupby("WARD").size().reset_index(name="Streetlight_Count")

merged_df = merged_df.merge(streetlight_count, on="WARD", how="left")
merged_df["Streetlight_Count"] = merged_df["Streetlight_Count"].fillna(0)

Exploratory Analysis

We used a correlation heatmap and a scatter plot to sanity-check relationships between:

  • income (MEDHINC_CY, AVGHINC_CY)
  • ward crime and arrests
  • streetlight counts
  • night-crime proportion
correlation_matrix = merged_df[
    ["MEDHINC_CY", "AVGHINC_CY", "Streetlight_Count", "Crime_Count", "Arrest_Count", "Night_Crime_Prop"]
].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Matrix of Income, Streetlights, Night Crimes, and Crime/Arrests")
plt.show()

plt.figure(figsize=(12, 8))
sns.scatterplot(
    data=merged_df,
    x="MEDHINC_CY",
    y="Crime_Count",
    size="Streetlight_Count",
    hue="Night_Crime_Prop",
)
plt.title("Crime Count vs Median Household Income (size=streetlights, hue=night crime)")
plt.xlabel("Median Household Income ($)")
plt.ylabel("Crime Count")
plt.show()

Modeling: high-crime wards (Random Forest + Logistic Regression)

To turn this into a prediction problem, we defined a high-crime ward as being above the 75th percentile of Crim_Count. Then we trained two classifiers using:

  • MEDHINC_CY (median income)
  • AVGHINC_CY (average income)
  • Streetlight_Count (streetlight data)
# Define high crime rate as top 25th percentile
merged_df["High_Crime"] = (merged_df["Crime_Count"] > merged_df["Crime_Count"].quantile(0.75)).astype(int)

X = merged_df[["MEDHINC_CY", "AVGHINC_CY", "Streetlight_Count"]]
y = merged_df["High_Crime"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

rf_model = RandomForestClassifier(n_estimators=100, class_weight="balanced")
rf_model.fit(X_train_scaled, y_train)
rf_pred = rf_model.predict(X_test_scaled)

lr_model = LogisticRegression(class_weight="balanced")
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)

print("Random Forest Performance:")
print(f"Accuracy: {accuracy_score(y_test, rf_pred):.2f}")
print(f"F1-Score: {f1_score(y_test, rf_pred):.2f}")

print("\nLogistic Regression Performance:")
print(f"Accuracy: {accuracy_score(y_test, lr_pred):.2f}")
print(f"F1-Score: {f1_score(y_test, lr_pred):.2f}")

We also used a feature-importance plot to interpret what the Random Forest reliad on most.

feature_importance = pd.DataFrame({
    "Feature": X.columns,
    "Importance": rf_model.feature_importances_,
}).sort_values("Importance", ascending=False)

plt.figure(figsize=(8, 6))
sns.barplot(x="Importance", y="Feature", data=feature_importance)
plt.title("Random Forest Feature Importance")
plt.show()

Regression: estimating relationships (OLS)

Finally,we ran OLS models to quantify relationships between income, streetlight count, and crime count:

  • Model 1: Crime_Count ~ MEDHINC_CY
  • Model 2: Crime_Count ~ MEDHINC_CY + Streetlight_Count
# Model 1: Crime_Count ~ MEDHINC_CY
X_vt = sm.add_constant(merged_df["MEDHINC_CY"])
y_vt = merged_df["Crime_Count"]
model = sm.OLS(y_vt, X_vt).fit()

print("Model 1: Crime_Count ~ MEDHINC_CY")
print(model.summary())

# Model 2: Crime_Count ~ MEDHINC_CY + Streetlight_Count
X_light = sm.add_constant(merged_df[["Streetlight_Count", "MEDHINC_CY"]])
y_light = merged_df["Crime_Count"]
light_model = sm.OLS(y_light, X_light).fit()

print("\nModel 2: Crime_Count ~ MEDHINC_CY + Streetlight_Count")
print(light_model.summary())

Takeaways

  • The analysis supports an inverse relationship between income and crime count at the ward level.
  • Including streetlight count improves explanatory power and appears to matter alongside income.
  • Classification models (RF/LR) provide a practical way to label wards as "high crime" based on a small set of features.

Next steps

  • Make the pipeline fully reproducible outside Colab (local paths + env file)
  • Add a dedicated evaluation for night crime prediction using Night_Crime_Prop.
  • Consider stronger spatial methods (spatial lag/error models) for geographic dependence.