Recently I’ve heard some things about when batters should swing. Many hitters take the first pitch of an at bat to get a feel for the pitcher. All hitters should take on 3-0.
I want to look at the rate at which batters swing based on the count.
library(magrittr)
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Load the data as d
d <- readRDS("MLB2018.rds")
We can use type
to see whether the batter put the ball into play.
d$type %>% table
## .
## B S X
## 261526 333381 126283
To see what qualifies as S vs X:
d$description[d$type=="X"] %>% table
## .
## hit_into_play hit_into_play_no_out hit_into_play_score
## 81802 28514 15967
d$description[d$type=="S"] %>% table
## .
## called_strike foul foul_bunt
## 121518 125753 1935
## foul_pitchout foul_tip hit_into_play_no_out
## 1 6160 12
## missed_bunt swinging_pitchout swinging_strike
## 402 1 71839
## swinging_strike_blocked
## 5760
d$description[d$type=="B"] %>% table
## .
## ball blocked_ball hit_by_pitch pitchout
## 242587 16931 1922 86
Now we will add a column for whether the batter swung or not.
They swung if it wasn’t called a ball (B) or if it was an S
with description called_strike
.
d %<>% mutate(swung = !(type=="B" | description=="called_strike"))
table(d$swung, d$description)
##
## ball blocked_ball called_strike foul foul_bunt foul_pitchout
## FALSE 242587 16931 121518 0 0 0
## TRUE 0 0 0 125753 1935 1
##
## foul_tip hit_by_pitch hit_into_play hit_into_play_no_out
## FALSE 0 1922 0 0
## TRUE 6160 0 81802 28526
##
## hit_into_play_score missed_bunt pitchout swinging_pitchout
## FALSE 0 0 86 0
## TRUE 15967 402 0 1
##
## swinging_strike swinging_strike_blocked
## FALSE 0 0
## TRUE 71839 5760
I’ll removed hit by pitch, but everything else looks okay.
d %<>% filter(description!= "hit_by_pitch")
Now we will add a variable for the count.
d %<>% mutate(count=paste(balls, strikes, sep='-'))
table(d$count)
##
## 0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-0
## 184418 92088 46899 71985 73837 68847 24517 38495 59116 7552
## 3-1 3-2 4-2
## 16042 35464 8
I’ll remove the cases with 4 balls, not sure where they come from.
d %<>% filter(balls != 4)
Now we can group by count and look at swing percentage.
d %>% group_by(count) %>% summarize(swingpct = sum(swung) / n())
## # A tibble: 12 x 2
## count swingpct
## <chr> <dbl>
## 1 0-0 0.290
## 2 0-1 0.475
## 3 0-2 0.514
## 4 1-0 0.422
## 5 1-1 0.536
## 6 1-2 0.578
## 7 2-0 0.432
## 8 2-1 0.585
## 9 2-2 0.651
## 10 3-0 0.108
## 11 3-1 0.559
## 12 3-2 0.724
As expected, the swing rate is lowest on 3-0, followed by 0-0. It is highest on 3-2 by a significant margin, but I don’t have any good intuition on why this should by higher than 2-2, 1-2, or 0-2.