The Grammar of Data Science

stared · on March 25, 2015

Speaking as a Pythonist (but who is in love with ggplot2 and dplyr) the wonderful thing about IPython Notebook is that its possible to inline R code with no more fuss than adding "%%R" in a cell.:

http://nbviewer.ipython.org/github/davidrpugh/cookbook-code/...

BTW: For pandas-dplyr dictionary: http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5f...

j2kun · on March 26, 2015

> The language is byzantine and weird

This is my biggest beef with R. It is constantly changing the dimensions and types of your data without telling you. Want to grab some subset of the rows of a matrix? Better add some extra post-processing in case there's only one row that satisfies your query, or else R will change its type!

The solution is not to make the programmer memorize obscure edge cases.

jghn · on March 26, 2015

You know that you can tell it not to do that, right? drop=FALSE

j2kun · on March 26, 2015

I think this only reinforces my point: this is a ridiculous default. But that is news to me :)

andy_wrote · on March 26, 2015

In light of this issue, dplyr's tbl_df structure (a light but helpful wrapper around data.frame) actually has different drop defaults, for example

  > x <- data.frame(foo=1:5, bar=1:5, baz=1:5)
  > dim(x[,'foo'])
  NULL
  > dim(x[,c('foo','bar')])
  [1] 5 2
  > dim(x[,'foo',drop=FALSE])
  [1] 5 1

compared to

  > x <- dplyr::data_frame(foo=1:5, bar=1:5, baz=1:5)
  > dim(x[,'foo'])
  [1] 5 1

Although I think these are more reasonable (I've got multiple commits at work with messages bemoaning drop=FALSE), this can ironically also mess you up if you got used to the old defaults :)

jghn · on March 26, 2015

There's no question that many defaults seem wonky to many users, however you have to take into account that the use cases when the language was created (particularly going back to S) aren't the same as they might be now. tl;dr classic statistics isn't the same as contemporary data science

stewbrew · on March 29, 2015

Use the right function, subset(), for the right effect then.

canjobear · on March 25, 2015

I am torn, because Hadley Wickham's tools are truly wonderful, but the underlying R language is such a mess. For example, R has lazy evaluation despite being an imperative stateful language.

I wish Hadley had developed these tools for some other language, such as Python, or in a language-agnostic way. Hopefully that is the direction things will go in the future.

x0x0 · on March 26, 2015

It's python that's really a mess for data science; you can't avoid it being a programming language first and a tool for data science a distant tenth. Syntax that only a programmer would like is necessary, and quite a bit of it at that. R is a much better fit for people who want to do statistics first, and as little programming as possible.

Thinks like function parameters being promises make it far easier to deal with functions like optimizers where there really are 10+ tuning parameters or things you may want to tweak. Iterative languages are far easier to understand for people who don't want to be programmers.

You cannot develop plyr or ggplot in a language agnostic way, because they need the purpose built syntax R has. Contrast to eg the fight in python to get an infix matrix multiplication operator.

j2kun · on March 26, 2015

I don't see how not having innate language support for an infix matrix multiplication matters. In R all "infix" operators are really functions anyway[1] so you could write your library that way. Alternatively, you could use python's overloading for infix operators. (Also, since when was Python considered not iterative? And for that matter, doesn't R's widespread use of *apply make it more functional anyway?)

[1]: Here is a simple example of that for +, note the very strange overloading of quotes.

    R version 2.15.2
    > "+"(7,5)
    [1] 12

x0x0 · on March 26, 2015

I well understand R's operators, but why on earth is that relevant?

math:

   S = ( H β − r )^T * ( H V H^T )^{-1} * ( H β − r )

python, ugly mess

   S = (H.dot(beta) - r).T.dot(inv(H.dot(V).dot(H.T))).dot(H.dot(beta) - r)

python, better: (although @ is an ugly matrix operator)

   S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)

the latter is an order of magnitude easier to understand, and looks just like the math. Having one layer of indirection: math to code, is far better than two: math to code to obfuscated code because you won't create infix operators.

Edit: examples stolen from the matrix operator pep

j2kun · on March 26, 2015

The problem is that you want to do everything in one line. The proper way to do this would be to save things like H.dot(beta) - r in a separate variable and compute it just once.

Moreover, if it's hidden in a library then why does the user care if it's ugly? It's the library designer's job to test it and make sure it's right.

x0x0 · on March 26, 2015

No, I simply want my math to look like math instead of a bunch of code that, after careful reading and some notes, implements some math. It's not a library function, it's the code I write. R, matlab, julia all allow users to write something that looks very close to the actual math. Python doesn't see that as a priority.

stared · on March 26, 2015

Yeah, it is one of things in which Python is not as flexible, but there are some attempts: http://stackoverflow.com/questions/28252585/functional-pipes...

craigching · on March 26, 2015

I disagree. I think if using dplyr + ggplot2 you can go very far without dealing with R's warts and I hope that the future of R lies in this direction. Honestly, R's syntax is not the problem, the "standard" library is the biggest problem in it's inconsistencies.

That said, the other big area of complaint in R is the type system. We are too often having to coerce types, but I'm not exactly sure of the solution for that.

andy_wrote · on March 25, 2015

Wickham et al's R packages are great, especially dplyr, and I think should be taught to new R users pretty much right off the bat. I find R's syntax to be a big hangup for new learners, especially on indexing and apply-to-each (sapply, mapply, just plain apply...), but dplyr really makes life much easier.

The %>% operator alone (which to be fair was originally from magrittr) is a great help. Not sure if this is my personal biases, but I always find it easier to read calls chained postfix-style.

craigching · on March 26, 2015

I'm actually reviewing a book due out this summer called "Data Computing" that introduces the "Hadley stack" as the way of getting started in data analysis and statistics. It's by a professor here in Minneapolis at Macalester College.

I agree with you about dplyr + ggplot and was pretty much gobsmacked at the obviousness of "this is the way it should be taught" and am glad I'm in the position to help review such a text!

I wonder if eventually this is the future of standard R.

hudibras · on March 26, 2015

The book sounds great; can you give any more details?

chillacy · on March 26, 2015

It's often a tradeoff between conciseness in one domain and generality in others. A similar story: Matlab is great for doing math and plotting, but I hated my life when I was developing a GUI in it. I later ported this project to Python, which was great for the GUI (relatively so) and a little less concise for the math. I find that tradeoff to be okay

gweinberg · on March 26, 2015

I don't understand why in this example python allows carat weight to go below zero, but surely that's not a problem with python as such.