Project Details

Hello There! Below are the project I am working on or completed to gain some expereince with SQL and Python. These projects are mainly focusing around Data Analytics, Data Cleaning, Visualizations and


Umpire Accuracy Exploration

As an avid baseball fan, Go Dodgers!, I wanted to expolore some statistical data for umpires behind the plate. I used data on Kaggle from the UmpScoreCard and explored into the data to see if any recent MLB rule changes have impacted the overall performance of umpire calls. A great way to improve my querying skills, at the same time review some baseball data.

High and Low Accuracy Games
  • 1 Perfect game with a 100% accuracy
  • 32 Total games that were above or equal to 99% accuracy. (19 in 2022 and 14 for all years before)
  • 183 Total games below or equal to 85% accuracy. (Large Majority are before 2020)
  • Yearly Metrics
  • Starting in 2019 we start to see an increase in overall accuracy and a drop in incorrect calls.
  • 2022 Playoff Statics were the best for the data points captured
  • 2017 Season had a sudden increase in accuracy and drop in incorrect calls when compared to 2 year before and after.
  • Does Home Field also Impact Umpires (Theory)
  • For the total data we see 9753 wins for Home Teams and 8459 Away Teams
  • We see generally the away team on average has a higher number of incorrect calls and correct calls. Accuracy remains the same.
  • Games with a 1 score difference we see the away team reciving on average 12 incorrect calls to 11 for the home team. With the average incorrect calls in mind, we saw in 2022 the away team winning 71 games and the home team winning 95.
  • Games with a 1 score difference we saw the away team reciving 146.16 correct calls to 141.65 for the home team. With the average correct calls in mind, we saw in 2022 the away team winning 137 games and the home team winning 190
  • Games with both teams scoring more than 10 runs saw the same story. With the away team reciving on average 14.27 incorrect calls to 13.56
  • There was not difference with high scoring games using the average incorrect calls in 2022. We did see the home team winning 4 games to 0 games when using the correct average.
  • Conclusion

  • With the more recent crack down on illegal substance that pitches are using in game also caused their overall spin rate to drop. Most major league pitches utilize a variation of fastballs, breaking balls, and changeups that all visually appear differently to batters. The overall rotation of the baseball seems to also impact how umpires are able to track the baseball into the strike zone.

  • As the MLB is looking to incoporate AI systems to improve the overall correct balls and strikes in a game. We could also see the catcher position alter as well. Catchers also have an important role in "framing runs" in which they are awarded a pitch outside of the strike zone with their glove placement during the catch of a pitch. Having a system that will automatically call the ball might also alter catchers into a more traditional on base position.

  • It is hard to say if the average of incorrect calls and correct calls has a correlation towards teams winning as most team expereice an advantage at their home stadium. We do see a slight difference with umpires calling incorrect and correct balls during a game. Incorrect and correct calls do impact how teams utilize thier bullpin, most teams have a set limit to how many pitches their starting pitcher throws and how many pitches each bull pin player throws. The sligh difference could provid an advantage to teams to pull their starter in order to prevent injuries. Teams can then take advantage of the new pitcher coming in to save runs or get crucial outs.

  • Coffee Review Project

    As an avid coffee drinker, I was excitied to see a data set on Kaggle with Reviews associate to coffee types and brands. I noticied the data set was scraped from the Coffee Review Site and started reviewing the details. I learned about specific coffee characteristcs and what determins the roast level of coffee. I also read into how each coffee is rated and the scales being used. I thought I knew my fair share, but after this I realized I knew very little about coffee. I wanted to take a look into a topic that I already enjoyed and see what I could add to the data set.

    Majority of the time was cleaning the data and updating details to make the data more manageable. Such as nulls, formatting, and missing data. After completing the data clean up, I created a view to start looking into my next batch of coffee to try next! I think I might go with Bird Rock Coffee Roasters

    Changes that were made in the data set

    • Updated review_date with date. Used to removed the 00:00:00 format in the standard mm/dd/yyyy.
    • Standardized NA values, found 'NA', 'Na/', '/', 'NA/NA'. Used NA to declare no value given.
    • Added Coffee Type based on url and description. As there were variations as expresso, bottled, whole bean, cold brew selection in the data.
    • Updated Null values in coffee characteristics. Scale used in the review were based on 1 to 10 values. Used 0 to indicate no value given.
    • Updated Slug column to end of sheet as review_url for easier read of data.
    • Created view to use in an excel sheet


    Califonia National Park Forecast

    The inital question that started this project was, "Can I attempt to forecast how busy a national park might be for next year? I wanted to see if we could forecast visitation to national parks in California to get some what of an idea on how busy each month might be.

    I purchased my first America the Beautiful National Park card this year and enjoyed every moment I had with it. I live in San Diego and we have a national moument that is close by, we also visted Joshua Tree and Yoesemite Twice this year. So the question began, can I see past yearly vists and forecast what we might expect next year and put this in a dashboard.

    I started on the NPS Stats Site, which had a lot of infomation avaiable to download. I used a large amount of infomation to start the project. I compiled all data into a more usable excel sheet and started working on a visualization. Data set that I used and compiled into a more usable format can be can found here Anual Park Rankings & Current Year Monthly and Annual Summary Report.

    The dashboard and findings can be found here: Califonia Dashboard. My future plan with this project is to use the data collected to also forecast all National Parks that are avaiable in the US. I will also be updating the project with more data that becomes avalaibe and checking if the predections were close to actual numbers.


    Movie Correlation

    For this project I wanted to explore some data visualization with python. We used the Kaggle - Movie Industry Data Set. I wanted to also start and learn pyton and in this project we setup Jupiter Notebooks in vscode, we used a few libaries and started working on the data. This was a great way to get an introduction into the language and see how powerful the tool is for future projects. I hope to start incoporating python to make some simple visual and correlations in the data.


    Data Cleaning

    In this project we are exploring a housing data set and focused on cleaning the data. The data has some formating and data issues, we attempted to clean the data to make it more manageable to use. We make updates to the table, parse some details like address, and replace null values with alread existing data points with in the set. The project was focused on a guide and a learning opportunity to visually see what to expect when dealing with data cleanup.