R and Python are the two mainstream languages for data analysis, and there are endless debates about which one is better. But for those who have tried both languages, the conclusion is actually very simple both are good but both are bad. Let me talk about my views from several aspects.
According to my recent observations, the domestic promotion and acceptance of Python is higher, but this does not mean that R is bad. I personally think that one of the most fundamental reasons is that Python is very easy to use. Friends with a little programming foundation can start free programming within a week. Therefore, for many data analysis novices, learning Python is undoubtedly the fastest way to get started. This reminds me of the original intention of Python? “Elegant”, “Clear”, “Simple” and so on. (Python design philosophy). Python was born to improve code readability and efficiency.
So R? R is different from Python. It was born with important missions of statistical analysis, graphics, and data mining. It is conceivable that R has incorporated a lot of statistical blood in the process of being developed. To give an important example, the default value of the second parameter stringsAsFactors in read.csv(“example.csv”, stringsAsFactors = F) in R is TRUE, which means that all strings in the file have been Do factor processing. This design caused me to suffer a few times when I read the data, and expressed that I couldn’t understand such a setting, and I was determined to seek justice from the R official. But after multiple verifications, I really understood the original intention of this design. R was widely used by statisticians from the beginning. In many statistical formulas, strings are directly treated as factors. For example, many modelling functions “lm()” and “glm(), it is better to read in as factor at the beginning , Easy to operate (detailed explanation of stringsAsFactors). It is really because of the design concept of +, it is difficult for beginners without a statistical background to have a good impression of R, and concluded that R is far inferior to Python.
It is precisely because of the different development goals that led to the difference between Python and R fan groups.
For IT developers with programming experience, many times they need to do some simple processing of data. At this time, the fastest and easiest Python becomes the first choice. After doing some Python projects, my friend and I also understood the pleasure. Starting from scratch, basically one afternoon, you can complete some simple data cleaning small projects. A PhD friend of mine who has never been in contact with Python but has always used C, also completed a small Hadoop project written in Python by his instructor in three days.
The users of R are just like the purpose of its development. It is mainly some statisticians who have studied scholars. They study data fundamentally and have a very deep understanding of statistics. R also makes it more convenient for them to use various models. In the process of learning R, I was very impressed by a sentence: “The closer you are to statistics, research and data science, the more you might prefer R.”.
I am a bachelor of software engineering. I have studied C and JAVA superficially, and have certain knowledge of programming theory. When I started studying data analysis at the graduate level, I decided to choose Python to show my programming ability. It is not an exaggeration to say that Python helped me solve most of the data homework, and I never thought about learning R.
Under the mandatory requirement of the teacher of Geo Spatial Analytics for Business Intelligence, I started to learn R. This is a data analysis course based on geographic information. Throughout the course, I was amazed by the simplicity of the code and the richness of the package when modeling various geographic shape data. And in the past few months, I started to learn R from the basics, and I am more and more impressed by the significance of R for statistics. As long as you can find the model you want, you can find the corresponding package in R. The process of learning R is also a process of continuously consolidating statistical knowledge, and statistics is also the basis of various models of data analysis.
After wandering between Python and R, I finally figured out a methodology for using them.
Python and R should appear as left and right hands in data analysis, not opponents.
The complete process of many data analysis projects includes:
Requirements definition → data acquisition → data governance → data analysis → data visualization
Python has obvious advantages in data acquisition and data governance. For example, web crawlers, Scrapy, a widely used framework for Python, and so on. Data governance includes many detailed data cleaning tasks. At this time, using Ptyhon, which is more flexible, will greatly improve efficiency. And finally get a “clean” data set that can be analyzed.
The data analysis step is actually a process of exploration and analysis at the beginning. R is much more convenient than Python in this respect. You can quickly understand the overall characteristics of the data set through various functions, such as dim(), summary(), etc. . Afterwards, you can roll back to the data governance link and complete the feature extraction of the model with Python. Furthermore, R is rich in various analysis models. After the Raw data is processed, it only takes a few lines of code to get the desired data mining results.
The last step is data visualization. When data analysts show the results of data analysis to others, intuitive graphics and images can often make people better understand. Although I feel that data is the world of JS, d3.js, django, leaflet and so on. So we should learn another JS? (I’m tired after thinking about it.) Still marveling at the richness of R packages. At present, most of the visual JS libraries used in Business Analysis on the market have been encapsulated by the R language gods, and the web framework of R shiny R can be encapsulated. Publish (I used Shiny as a result display for my graduate thesis) and directly show customers various business scenarios, which is very efficient.