Data Science Specialisation

3 minute read

Yesterday the Johns Hopkins School of Public Health published a post about their Data Science Specialisation on the online MOOC platform Coursera.

The post metiones the first batch 266 students finishing the specialisation (among them, me :-) ). In total more than 800,000 people have registerd for one of the courses, out of which 14,000 finished at least one.

The Specialisation

The Data Science Specialisation consists of nine courses and a capstone project (which is was announced, but is yet to open for registration). The courses are:

  • The Data Scientist’s Toolbox
  • R Programming
  • Getting and Cleaning Data
  • Exploratory Data Analysis
  • Reproducible Research
  • Statistical Inference
  • Regression Models
  • Practical Machine Learning
  • Developing Data Products

The entire track seems to be developed based on the replicability approach in the biostatistics field (which is also very dominant at rOpenSci). Much of this involves the RStudio IDE. As I previously mentioned, most of this methodology can and should be applied in other science (where possible), such as, economics.

Impressions

The first course will simply help you setup R, RStudio, Git, Github. If, like me, you were already using these tools in your workflow then I would recommend that you also register for R Programming.

R Programming is quite a bit more challanging, but also very rewarding, in learning the core principles of R as well taking steps you might otherwise be inclined to avoid (e.g. writing your own functions).

Getting and Cleaning Data is also quite tough, I did not find this course that rewarding, since much of it is focusses on Twitter and other APIs or SQL interfaces, which I have never encountered in my research so far.

Exploratory Data Analysis is a very useful course, it explains all the basic plotting systems such as base, lattice, and ggplot2 (no ggvis yet, hopefully they will add it soon).

Reproducible Research I also found quite interesting, much of it was familiar, but it was good to have a structured way to think about these topics.

Statistical Inference, this was a challanging course, I did my Masters in Econometrics and so I was not expecting this to be so difficult. In fact it was the only course which I had difficulty with in getting a distinction. The material covered in the course is much more theoretical than of the other courses, and the lectures are not always very relevant to the quizes.

Regression models I did not find very interesting, but for those who have not done this before I am sure it is a relevant course.

Practical Machine Learning was also quite a difficult course, the quizes were not easy and took a long time to complete, the assignment was also quite tough. Altogether I am happy I took it, because I never really working with such an approach, but it does seem relevant.

Developing Data Products was in my opinion the most fun course. Building interactive graphics using Shiny is something very new, it gives you a lot of freedom and the results are cool. This course needs to be updated to include recent RStudio packages such as rmarkdown v.2 (should be in reproducible research also) and ggvis.

Resources

If you are interested in taking courses for the specialisation then you should have a look at theses resources which might be helpful.

Certificate

Coursera offers the option of getting a verified certificate, this costs $49 per course. I have always paid this, which incentivised me to complete all courses with a distinction. In general I think a small investment helps you be more committed.

First and foremost, the certificate is a good way of showing employers, etc. how you performed. This might to be accepted everywhere yet, but I think this will quickly become standard.