Data Science and PAW Patrol

Step 0. The disclaimers

  1. I don't call myself a data scientist, but other people do, and I wanted to share my approach to data science. This approach has worked well for me professionally. There are many good approaches, and I don't want to gatekeep any of those.

  2. PAW Patrol is a registered trademark of someone else. I don't own any rights to PAW Patrol. If I did, I would be doing something better with my free time.

Step 1. Start with the science

Thinking like a scientist means having a question, and then searching out an answer. Science isn't data-driven, in that the data can provide an answer, but the available data shouldn't suggest the question. At least in most of the cases I can think of.

This is a huge challenge to productionalize data science in a business environment. With a recent (unnamed) employer, we had a big data-engineering effort to rebuild an analysis pipeline. As an analyst, it became slowly clear that with the new pipeline, we could only recreate 5 of the 6 key metrics. I left the company before the proposed workaround was implemented.

For my example of methodology, I'll use PAW Patrol. It's a kids' animated TV series. To describe cast of characters, and the plot, here are the lyrics to the theme song:

PAW Patrol, PAW Patrol

We'll be there on the double

Whenever there's a problem

'Round Adventure Bay

Ryder and his team of pups

Will come and save the day

Marshall, Rubble, Chase

Rocky, Zuma, Skye

Yeah! They're on the way!

Yeah. It goes on. You get the point. But the show can be endearing, and I think the deeper reason it resonates with kids is because the team is useful and appreciated by the adults they help. But that's a topic for another article. [1]

If you take a vote in my household, a consensus of 50% of voters will declare PAW Patrol to be the single best achievement of humanity. And it's been decided, by compromise, that if we ever get a pet dog, the name shall be either Zuma-Marshall, or Marshall-Zuma.

paw patrol team

The question

I was curious - with PAW Patrol hype slowly sweeping the planet since 2013, do we see an increase in dogs named after the crew?

Step 2. Curate the data

I need open data with pet names. Dog names. There are several, and I'll outline why they make a good or bad candidate for the analysis.

Hundenamen aus dem Hundebestand der Stadt Zürich

This one is from the city of Zürich, Switzerland, where I live. I've seen a recent Twitter post about this dataset, so that may have planted the idea that dog names can be open data.

Data goes back to 2015, and each year is one CSV file. To get an idea of the dataset size, I choose the complete year of 2019. 7647 records. It may be hard to find trends in so few dog registrations. Additionally, the Paw Patrol trend is slowly making it here to Switzerland. Since it started in North America, I'll go to look there.

Anchorage Dog Names over Time

Only 16k total names between 2017 and 2019. That's not enough dogs when there are so many possible names. And starting in 2017, I may not get a good before snapshot.

Seattle Pet Licenses

A list of active/current Seattle pet licenses, including animal type (species), pet's name, breed and the owner's ZIP code.

This might be a good dataset because records go back to 2000 and are updated through 2019. I can get snapshots before and during the PAW Patrol era. But I counted dogs registered in 2019 and it was 11k. In 2018, 7k. Still not enough.

NYC Dog Licensing Dataset

This could be it. Recently updated. 24.1 MB CSV file. 345k total rows going back more than 10 years. 79k dog registrations in 2019. Explore the data here.

The fine print:

Each record stands as a unique license period for the dog over the course of the yearlong time frame.

What does this mean for my data? It means that dog names are assigned at least once per year. If I count unique dog names over multiple years, I'll be over counting.

and

Each record represents a unique dog license that was active during the year, but not necessarily a unique record per dog, since a license that is renewed during the year results in a separate record of an active license period.

This means that dog-names within a given year may actually be duplicate as well. If this was a real project, in order to fully trust my data, I would first count how many names are repeated. To do this, because there is no column dog ID which would uniquely identify a dog, I would have to create a surrogate key based on the columns such as AnimalBirthMonth, AnimalGender and BreedName, and perhaps also the geographical data Borough and ZipCode.

If I cared more about PAW Patrol pet-naming theory, I could build a unified dog-name data-model and combine datasets, and keep looking for new sources to add. Maybe even make some FOIA requests. But I don't care that much. On to the answer. Let's hope NYC is a trend-setter -- when it comes to naming pets after kids' cartoon superheros.

Step 3. Explore the data

Lots of tools but because I know that I won't productionalize this pipeline, I'm going to dump the data into the commercial BI tool Tableau Desktop. Licenses are expensive, and there are many solid open source tools that will do the job. But the software allows me to do fast visual anaylsis. If you sign up for the free audit of this Coursera, you'll see that after clicking through the 3rd week or so, you are provided a 6-month software license to complete the rest of the course. YMMV.

Visual Analysis

In this step, I'm trying to get a feel for the data. Is it clean, is it reliable? Will it answer my question?

What are the 5 most common names for dogs in NYC?

Name Frequency
UNKNOWN 5379
BELLA 3824
NAME NOT PROVIDED 3763
MAX 3582
CHARLIE 2852
COCO 2636

Woof. That's a lot of NULL. Let's see how it's evenly distributed throughout the years

AnimalName 2014 2015 2016 2017 2018
UNKNOWN 3 1179 1903 1294 1000
BELLA 19 427 1332 1233 813
NAME NOT PROVIDED 11 963 1374 847 568
MAX 34 444 1214 1166 724
CHARLIE 26 293 1027 935 571
COCO 22 296 941 817 560

OK, better. It seems pretty flat, except for 2014. Let's do a quick check of total names per year, to see that indeed 2014 has many fewer.

Year of LicenseIssuedDate Frequency
2014 2650
2015 42439
2016 119080
2017 110995
2018 70563

Bummer that 2019 registrations aren't there yet. I hope NYC is trend-setting enough to show a signal in 2018.

Another learning is that we should exclude 2014 from the calculation -- there are just not enough records, especially in comparison with other years.

Initial Calculation

I want to start to measure trends, and I'll do so visually, since I need to only compare a handful of names.

Marshall, Rubble, Chase, Rocky, Zuma and Skye. Let's add Ryder, even though he is a human in the cartoon. And the pups added in later seasons: Everest and Tracker.

AnimalName 2014 2015 2016 2017 2018
CHASE 7 49 115 170 86
EVEREST 1 9 5 3
MARSHALL 7 26 19 11
ROCKY 20 300 823 785 486
RUBBLE 6 2
RYDER 13 28 26 18
SKYE 10 61 54 36
TRACKER 1
ZUMA 2 2 2 2

A couple things to notice. There aren't enough Trackers, Rubbles or Zumas to use in any analysis. And Everest probably doesn't have enough data to compare, either, especially since she was a late addition to the pack.

And because the yearly totals of registrations varies so greatly, I need to normalize my data. This means that I take the count of each unique name and divide by the total dogs registered in that year. Noticing that Rocky is the most popular from my list, and that in 2016 there were 823 Rockies registered, from a total of 119,080 registrations, I want to normalize for the prevalence of each name to a fixed integer, so that I can compare years. In effect, how popular a certain name was in that year. The metric is then be frequency of dog-name per 10000 dogs.

Closing the calculation

The math doesn't get too complicated, because the data doesn't require anything else. I'm going to use my normalized frequency metric to compare trends for Chase, Marshall, Rocky, Ryder, Rocky and Skye.

I've removed 2016 and 2017 now too, to try to more clearly show before and after. It's only Skye that shows any trend -- effectively doubling in popularity.

paw patrol results

Step 4. Shutting it down

I can't believe I've spent this much time thinking about PAW Patrol. What I want to stress is that when you are working with data, for me it has been essential to

  1. Understand the subject matter. In this case, that means passively absorbing all things PAW Patrol.

  2. Get dirty with the data. If you try to plug an algorithm on top of data, it's unlikely to give any meaningful results. Can you imagine creating a fancy data pipeline that spits out the 3 most popular dog names? UNKNOWN, BELLA and NAME NOT PROVIDED.

In case you haven't heard it already, here's their theme song. I've been singing it to myself as I write this, and I hope it gets stuck in your head, too.

🐶🦴

And remember--Whenever there’s trouble, just yelp for help!


Got a comment? Share your thoughts on Twitter or HackerNews.


Footnote

[1] For a less fun look at PAW Patrol's effect on society, take a look at this recent academic paper:

“Whenever there’s trouble, just yelp for help”: Crime, conservation, and corporatization in Paw Patrol

I argue that the series suggests to audiences that we can and should rely on corporations and technological advancements to combat crime and conserve, with responsibilized individuals assisting in this endeavor.

Ultimately, PAW Patrol echoes core tenets of neoliberalism and encourages complicity in a global capitalist system that (re)produces inequalities and causes environmental harms.