This episode is about AMPLab at UC Berkeley, which is a five-year collaborative effort between students, researchers, and faculty that is focused on addressing the "Big Data" analytics problem. That problem? While massive amounts of computing power and huge amounts of data (from a broad variety of sources) are available, the software and algorithms necessary to take advantage of these opportunities is not as good as it needs to be. "Algorithms" and "Machines" are the "AM" of "AMP," and the last piece of the puzzle, "People," refers to the fact that humans are often necessary to solve the problems that machine-learning algorithms cannot.
AMPLab has birthed a number of major leaps forward over the last few years, including Apache Spark , Apache Mesos, and Tachyon Nexus (which is discussed in Part II of this episode). For those interested in how the various AMPLab projects fit together, this chart is instructive.
In Part I, below, a16z's Michael Copeland speaks with co-founder and director of the AMPLab, Michael Franklin, and a16z’s Peter Levine to discuss the AMPLab model, and their relationship.
The story of AMPLab
Franklin and Ion Stoica (co-director of AMPLab) took two years off to start a company, and when they got back, academia seemed comparatively slow and quiet. A project at Berkeley called RADLab, which had organized a number of systems/machine learning experts to work on autonomic computing, was getting ready to wind down, and Franklin thought these people could be oriented towards "the big data revolution"—every company was getting more and more data, and would somehow need to manage it.
The ingredients to AMPLab's success
Levine argues the key is a macro trend in systems software. While, traditionally, systems software experts left Cisco/Oracle to create and join startups, their progress was more incremental. The new generation is working on the most interesting CS problems applied to systems software—big data, databases, OS software—and brings new thinking and new ways of doing computing.
Franklin points out that open source has made a lot of AMPLab's success possible. When he was a researcher, you would present new software ideas to Oracle/Microsoft, and they'd hire you, steal it, or ignore you. Now you can make a piece of software, blog about it, put it on Github, and if it's useful, people start trying it. The friction is much lower.
Corporate involvement in AMPLab
Funding comes about one-half from NSF, Darpa, and the like, and half from companies (30 sponsors). The corporate support is valuable because the Lab can share plans and receive instant feedback, in addition to information on problems the companies are having. The engagement also makes the sponsors more inclined to use AMPLab's output. That said, the companies don't get any IP rights—everything is open source.
While open source may appear to be less profitable for Berkeley than patents and licensing, Franklin notes that the Lab has brought in eight-figures of industrial donations, and that at most one University patent, ever, can match that (he speculates they might have an early web browser patent).
a16z's relationship with AMPLab (they've invested in three companies)
Levine is particularly attracted to the centralized mechanism for new project generation at AMPLab. In terms of how to monetize open source, for a16z, IP ownership is less important than who wrote the code. The ideal project-company has the inventors as founders (forks tend to be inferior), and AMPLab companies usually have this.
How does AMPLab know what's a good idea?
A lot of it is someone in the lab being passionate (clichéd, but true)—often projects posted on Github get traction. Commercial potential is not high on the list of criterion. Franklin argues that those who believe that there's a dichotomy between good research and useful projects are wrong, pointing to AMPLab's success both in research prizes (such as the ACM dissertation awards) and commercially.
The most important near-term advances in Computer Science
According to Franklin, machine learning and deep learning are the next big thing, as we can now collect data and do real-time big-data work (a la Apache Spark). We can be much more predictive about the data we have. (New databases is also an interesting area.)
Franklin would also like to reach out from databases to affect the world—cloud robotics, drones, and Internet of Things. The old version of IoT was putting sensors out into the world, while the new version involves interacting with the world (including machines co-existing with people).
Levine notes that both universities and companies are pursuing the idea of moving "compute" (the process of computation) out to the endpoints (smartphones, PCs). Currently the world is centralized, wherein most heavy computation happens in the cloud. Now the supercomputers in our hands are actually being used as computers to do real-time analytics, and not merely acting as displays.
People as the "P" in AMPLab
The role of people in the lab has evolved. At inception, the idea was that Algorithms, Machines (cloud computing, clusters), and People are the three types of resources available to make sense of data.
People's role, specifically, is in human computation and crowdsourcing. For example, Tim Kraska, in the early days of AMPLab, worked on a project called CrowdDB. If you asked the database a question without a network connection, it would ask the user, "I don't know, what do you think?" With a network connection, it would use Mechanical Turk and let the crowd answer. It was a "dumb" database that leveraged people to answer questions machines could not.
These days, AMPLab is still doing a lot around getting people (individual experts/analysts or crowds) to do data cleaning, and to solve machine learning problems that the machines aren't up to snuff on. The Lab is also concerned with the fact that the ultimate results of most analyses will be in front of a person, and how to best present the output.