class: center, middle, inverse, title-slide .title[ # Welcome to DSC365: Introduction to Data Science! ] .author[ ### Alison Kleffner ] .date[ ### August 19, 2025 ] --- ## Homework - Software downloaded by beginning of class on **Thursday August 21** + Should hopefully have time at end of class --- ## Alison Kleffner, PhD - BS in Applied Mathematics and Economics at Rockhurst University - Master and PhD in Statistics at University of Nebraska at Lincoln - Thesis: *Visualization and Modeling of Multivariate Data in Environmental Applications* -- <br> <br> **Now let introduce yourselves! Get into groups of 2-3** - Name - Something your group has in common (besides things like major ...) --- class:inverse <br> <br> <br> <br> <br> <br> <br> <br> .center[ ## What is Data Science? ] --- ### Data Science Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. - Data science is a vast field, and there’s no way you can master it all by taking a single course. But we are going to lay a solid foundation. .center[ <img src="../images/data-science.png" width="60%" /> ] --- ### Data Science .center[ <img src="../images/data-flowchart.png" width="1477" /> ] --- class:inverse <br> <br> <br> <br> <br> <br> <br> <br> .center[ ## Syllabus and Course Specifics ] --- ## What will be in this course? - This is a programming course using *R*, the most commonly used *statistical* language - We will briefly cover each part of the Data Science Cycle - In this course you will complete in-class labs, projects, and presentations. - Since this course involves using *R*, please bring your laptops to class with you --- ## Syllabus --- class:inverse <br> <br> <br> <br> <br> <br> <br> <br> .center[ ## Software Set-Up ] --- ### How to Learn a Programming Language If you have never used *R* (or even heard of it) before, it is totally OK! We will be starting with the basics. .pull-left[ - Breathe - Mistakes are ok! - Ask for help, Google is your friend! - Translate code into plain English - Don't reinvent the wheel - Walk away ].pull-right[ <img src="../images/begin-R.png" width="1565" /> .center[<font size = "0.75">Artwork by Allison Horst</font>] ] --- ### What is R? *R* is a free software environment and programming language for statistical computing and graphics - Open-source - **Interpreted language**: we interact with R through a “command line” interpreter, which translates our “code” to machine code - CRAN (comprehensive R archive network) is R's central software repository. Contains contributed packages - New major release once a year <img src="../images/base-r.png" width="60%" style="display: block; margin: auto;" /> --- ### What is R Studio? Every R installation comes with the *R* Console, so we don’t actually need an additional program to interface with R. BUT most of people who uses *R* ALSO uses RStudio to interact with it. - “Integrated Development Environment” - Clean user interface <img src="../images/rstudio.png" width="90%" style="display: block; margin: auto;" /> --- ### Implication of Open-Source Software Because *R* is open source, users can contribute *R* packages to add additional functions and capabilities (more than 19,000 as of March 2023). - A complete list can be found [here](https://cran.r-project.org/web/packages/available_packages_by_name.html) - Pro: New statistical/data science techniques are added to CRAN, Bioconductor (another package repository), GitHub, etc. daily - Con: No standard syntax! Not all are well documented --- ### Reproducible Research Excerpt from the Simply Statistics blog: “The Real Reason Reproducible Research is Important” [(Source)](https://simplystatistics.org/posts/2014-06-06-the-real-reason-reproducible-research-is-important/) - **Reproducible**: the original data (and original computer code) can be analyzed (by an independent investigator) to obtain the same results of the original study. - Reproducibility is important because it is the only thing that an investigator can guarantee about a study. - It does not necessarily ensure the results are correct, but does ensure transparency. More reasons to do your programming in a reproducible way: 1. **Time saved**: especially important in the professional world 2. **Time elapsed**: even the best programmers forget how their code runs eventually --- ### Reproducible Research with Quarto Quarto provides “an authoring framework for data science”. In an Quarto document, you can: - Save and execute code - Generate written reports that can be shared with someone else - Supported file formats: Word, PDF, HTML, slide shows, handouts, dashboards, Powerpoint <br> <br> **More on this next week** --- ### Let's Install it Together 1. Download and run the R installer for your operating system from CRAN - Windows: https://cran.rstudio.com/bin/windows/base/ - Mac: https://cran.rstudio.com/bin/macosx/ - Linux: https://cran.rstudio.com/bin/linux/ 2. Now download RStudio from the RStudio website – IDE of R - https://posit.co/download/rstudio-desktop/ 3. If you are using a Mac, you may need to download XQuartz. https://www.xquartz.org/ - If you run into issues, try downloading this to see if it helps. Go with all default options <br> <br> **Other Option**: If you would prefer to not download it to your computer: https://posit.cloud