with Ed Borasky (@znmeb on twitter)
We can let the machine do its own work. It's a hands off approach to managing.
Natural Language Processing Data Hacks Genetic Algorithms K Nearest neighbor algorithms Clustering Support Vector Machines Scalability (a huge problem since some algorithms are in O(n^3) or worse time)
Categorizing articles in RSS feeds to make a daily paper from the blogs you read. Finding new, eye opening, news sources and having them brought to us. Sentiment analysis- Commercial applications of determining if someone has a positive or negative opinion of product that they are talking about. This is a difficult problem, complicated by sarcasm and other language use factors. How can machine bridge the correlation to causality gap?
Latent Semantic Analysis to reduce the last 200 tweets to simple commonalities using singular value decomposition, shared subjects. We treat each persons tweets as a single document and make a matrix of the terms they used. This can be very slow in R. Bayesian classifiers to filter out annoying tweets. Every tweet is run through a constant time calculation to determine is class. RSS vs. Twitter as a data source. Blogs are more focused on specific topics.
The benefit to Netflix is in the hundreds of millions of dollars.
Come see the Write Your Own Bayesian Classifier talk by John Meleski at Open Source Bridge.
Possibility of an R language school that would meet twice. First day how to install and set up R. Second day, doing some modeling and data analysis.
Toby Segaran’s book, “Programming Collective Intelligence” (O'Reilly, 2007). ?ADD that blog here?
R has a comprehensive NLP library that allows clustering and other techniques. Python Helpers Libraries for Faster Numeric Computing: scipy numpy