Analysis of Terry Stops in Seattle

A data science project gone awry

8 min readJan 8, 2021

A Terry stop in the United States allows the police to briefly detain a person based on reasonable suspicion of involvement in criminal activity. Reasonable suspicion is a lower standard than probable cause which is needed for arrest. When police stop and search a pedestrian, this is commonly known as a stop and frisk. When police stop an automobile, this is known as a traffic stop. If the police stop a motor vehicle on minor infringements in order to investigate other suspected criminal activity, this is known as a pretextual stop. — Wikipedia

In this data science boot camp project, I attempted to analyze data provided by the city of Seattle to build a classification model that would accurately portray the relationship between several key demographic variables recorded by the officer performing the stop and whether or not an arrest was made as recorded in the Arrest Flag target variable.

Due to the sensitive nature of the data provided, including race and gender of both the officer and the suspect, I was not concerned with the model’s ability to predict future arrests based on this information. That would be building any existing racial or gender bias into the model and thus perpetuating the bias. So instead of providing predictive value, instead I was merely looking for a model that accurately classifies the data I have been given.

This seems like a good place to look at what a false positive and false negative would be. I have coded an arrest as the positive case. So a false positive is predicting an arrest when no arrest was made. A false negative would be predicting NO arrest when an arrest was made. This model won’t be used to predict future cases of Terry Stops, it is just to analyze past decisions. Therefore, I think the most important metric will be accuracy. How accurately did my model predict the actual data that I was given? This will give us the most useful information about how the decisions appear to have been influenced by the variables examined. Accuracy answers the question of ‘How many classifications did my model get right?’

F1 is possibly also the right metric for my model as it gives us a more generalized view of the model without as much potential for overfitting. If I just focus on accuracy and fit exactly to the data, I cannot make any generalizations about observations outside out data.

Data Cleaning and EDA

Gloved hands cleaning a surface — Photo by Anton on Unsplash

Being my first solo end-to-end Data Science project, I was overdue to learn the lesson of just how important it is to understand, visualize, and examine your data before you try to begin modeling. At first glance I had noticed that the data appeared to be ‘mostly’ sorted by date. It wasn’t perfectly sorted, but if I needed it to be then I could fix that.

Then, starting my analysis of the first variable, Subject Age Group, I noted that 1449 of the observations had a dash in this field. That was easy enough to replace with ‘Unknown’ for clarity, but I happened to have a look at these records. They were 1449 of the first 1459 records in my dataframe. Obviously not a coincidence. These early records must have come from a time before they started recording the Subject Age Group. I had previously worked in software implementations and was well aware that when switching to a new method of reporting, some data fields just get filled in with a place holder. If the unknown ages appear relevant in my models, it may have more to do with the date than the actual subject’s age.

Next I looked at Subject ID. There were over 8000 unique values, some used more than once, but the overwhelming majority were blank, or rather -1. I chose to change the -1’s to ‘Unidentified’, as the data description indicated that subject’s are not required to provide identification in a stop-and-frisk situation. Then I binned all of the one time or first time occurrences into “First” and the repeated IDs into “Repeat”. I hoped this would give me some insight into whether a subject is more likely to be arrested if they have been repeatedly stopped, or perhaps they are more likely if they refused to provide identification.

None of the Unidentifed had Arrest Flag = Y. I guess they didn’t arrest any people without identification. There is the missing piece. Once they arrest you, they take your identification. So if you were arrested you are necessarily no longer Unidentified. This seems like it is going to be a problem, but let’s keep looking at each variable.

Several more variables went by like Weapon Type and Terry Stop ID. Generally I binned the values together when it made sense and deleted the column entirely if it overly complicated the process or I didn’t think it would have an effect on the modeling.

Next I came to Stop Resolution. One of the possible values in this column is Arrest. Seems like it should be an indicator for the Arrest Flag. It turned out to only be a one-way interaction if I can describe it that way. Nearly all of the observances with Arrest Flag = Y also have a Stop Resolution of Arrest. But over 8000 Stop Resolutions of Arrest have an Arrest Flag = N. What does this mean?

In my naivety, I chose to leave the variable as it was. If the initial models indicate that it is a confounding variable then I would have to address it. Then I came to Reported Date. I had heard that you shouldn’t have Date as a variable in your model, not really sure why except that if that variable is significant it means a particular date is important and I guess that would be strange. So I separated it out into month, year, and day of the week, but I left Date in there too (converted to an ordinal rather than date format). I was planning on using SMOTENC to deal with the class imbalance since almost all my data was categorial, but that requires that you have one continuous variable so Date would be it.

Dates ranged from March 2015 through the present, with about 7000–8000 stops per year, which I learned from a histogram plot. So I moved on with data cleaning, train test split, SMOTE, and scaling my one continuous variable of Date. My first models, a logistic regression, came out perfect on all scores so obviously something was wrong. I was suspecting Stop Resolution. I did an ANOVA test and found three variables with p-values literally equal to zero: Stop Resolution, Subject ID Bin, and I believe Call Type. Then I ran an initial Decision Tree model to see what the top variables were on the tree for information gain.

Subject ID Bin was at the top. Now that I think I have figured out that if you were arrested you would be identified I needed to remove this and rerun the models. Suddenly Date was my top variable. I needed to understand why Date would be so important. So I ran a plot of Date versus Arrest Flag.

All of the arrests happened after a certain date (I couldn’t tell when right away since it was ordinal). Well it didn’t make sense that no one had been arrested before that, but I dealt with it by sub setting the data into just the date range that would include the arrests. This actually removed most of the observations where Subject ID Bin was Unidentified.

I was starting to see the pattern in the bigger picture here and could have saved myself a lot of time if I had started by looking at the data on a higher level. By sub setting the data on Date I had also gotten rid of all of the Stop Resolution = ‘arrest’. Arrest Flag was not a variable that had always been captured in the reporting, just like Subject Age Group and probably Subject ID. The data had gotten convoluted over time due to changes in how it was being reported. I had also notice that Weapon Type had many different possible entries with the same meaning so there had really been no consistency in how it was being reported, like having to choose from a list of possibilities.

More Data Preparation

I needed to continue with the modeling since I was running out of time on the project. I planned to run a Logistic Regression and some KNN models as well as some boosting and decision trees. With all the categorical variables I would need to get dummy variables (or one hot encode them) then I would also be scaling and running a SMOTE to address the class imbalance.

First I ran the One Hot Encoding and discovered that I now had over 4000 independent variables. I had left the date in and each unique date had been given its own column.

After fixing the OHE, I ran the SMOTE to generate some synthetic data. But noticed then that all of my categoricals that used to be nice binary 0 or 1 values suddenly had values between 0 and 1. That is how SMOTE works I guess, by generating data in between the actual data, but I didn’t think this was going to work for any modeling.

So I reversed the order and did the SMOTE first and then the OHE. The next step was to scale the continuous variables, just Date in my case, for the models that required it. I did originally run the scaler on all of the data because the lesson that I had on scaling said it was ok to do that as long as the categorical variables still turned out binary, even though they were no longer 0 and 1. But after dealing with the SMOTE I really just wanted to keep the 0’s and 1’s so I could visualize if anything was going wrong. So I had to rerun the scaler again.

Now I felt my data was ready for some modeling. Until I saw in one of my lessons that you are supposed to fit the models on the SMOTE data but not actually predict on that data since it isn’t real. At this point everything I had was SMOTE then OHE and then scaled. So to get data to predict on I had to go back to the data I had before the SMOTE and still run all of that through the OHE and the scaler. Keeping track of all of the different dataframe names was getting to be a nightmare.

So the process wasn’t smooth, there was a lot of going backwards to fix things. And I didn’t really have confidence in the data that I was working with because of all of the categories with ‘unknown’ values. But I was out of time and ran the models. Some came out better than others, XGBoost in particular. And I ran it through grid search few different times trying out different parameters to find the best. But the results came out very poor, misclassifying more arrests than it actually got right.

When I selected this project I had hoped to get some great and interesting results that I could relate to the current climate of social justice. Although that didn’t really happen, I think I can safely say that I learned a lot about Data Science this week. Not the stuff they teach you in the classes, but the useful real world lessons that I can carry with me and a lot of mistakes that I won’t make again.

Analysis of Terry Stops in Seattle

A data science project gone awry

Data Cleaning and EDA

More Data Preparation

Written by Melody Peterson