UtilitiesIntroduction

When it comes to data science, tools matter. Some workflows facilitate efficiency and insight, while others can leave you spending most of your time putting out fires. To borrow a familiar example from document editing, it might take 15 minutes to go through a report and capitalize every instance of a particular word, but your editor's find-and-replace feature can do the job with no errors and in less than seconds. Merely being aware of the find-and-replace concept leads to significant time savings, because you can look up how to do it if you don't remember.

In a similar way, taking advantage of the collective wisdom of the statistical and software development communities is a major productivity multiplier. Learning a reasonably complete set of tools and techniques up front spares you the inefficiency of trying out lots of possibilities and inevitably developing some habits along the way.

The set of programs and formats we will cover in the course aspires to be as close as possible to a canonical open source data science toolkit. In particular, all of the tools are widely used in industry or academia and have large user bases.

On the other hand, some toolkit roles are filled by more than one popular program, so assembling a complete software suite does require making choices. You should feel free to substitute other tools when they meet the same needs and have comparable benefits to the ones we will discuss in this course. On the other hand, don't be too reluctant to appreciate the benefits of switching to something new. You can be surprisingly productive surprisingly quickly with a well-designed interface.

Goals

Learning from the principles of best practice offers several advantages to the data science practitioner:

Efficiency. It's preferable avoid taking far longer than necessary to perform common, often mundane tasks.

Correctness. Getting incorrect results is harmful and potentially quite dangerous. Building good habits for organizing your work and avoiding common pitfalls can help you consistently achieve correct results.

Reproducibility. A key component of transparency and confidence in your results is the ability for you and others to verify the analysis by re-running it. A workflow with even one non-reproducible step is not compatible with this goal, so it's important to prioritize reproducibility throughout the learning process.

Clarity. Workflows that incorporate opaque, ad-hoc elements or obscure the reasoning involved in each step make it more difficult to re-use your work, reproduce it, and place confidence in it. Best practices can help you highlight your reasoning and make your steps easily navigable.

Open Source Data Science

All of the software introduced in this course is free and open-source. This means that source code is available for anyone to inspect, alter, and extend. Using open-source software has many advantages for companies and individuals, even if they have access to commercial software.

Agility. If you need to change tools or try something out, you can just do it. There's no need to make a hasty decision just because a license renewal is coming up, or to negotiate with a representative from the software provider about something novel you want to do.

Community. Open source development has become popular enough that the scrutiny on a given piece of code is often larger for an open-source project than for a closed-source one. This has implications for code quality, and it makes it easier to search the internet for solutions and ideas. Similarly, the number of third-party packages available for open source software is typically orders of magnitude larger than for proprietary software. This can make it easier to customize a solution for a particular set of needs.

Integration. Because open-source projects are a joint effort of the global scientific and development communities, significant effort has gone into making them work with one another. This often allows the user to use to choose the best tool for each aspect of the job at hand, transitioning between tools as necessary.

Accessibility. If you want to make your work available to others, you can take advantage of services like CoCalc or Binder or ask that people download the necessary software to their machines. If your work requires an expensive license to reproduce, your target audience is less likely to engage.

Because of their advantages as open-source programming languages with large and committed user bases, Python and R dominate data science in industry (although several proprietary systems also enjoy widespread usage). Many of the other tools we will discuss in this course are the de facto standard tool for their use case and have no real competition from commercial offerings.

Exercise
Select the true statements.

Python and R require expensive licenses to use.

Reproducibility refers to the ability to reliably get the same results for a given data analysis.

Learning appropriate software for solving challenges faced by data scientists can help save time in the long run.

All of the programs we will discuss in this course are used by all data scientists.

Изменить язык

Войдите в Mathigon

Поделиться

Сбросить прогресс

Глоссарий

UtilitiesIntroduction

Goals

Open Source Data Science