A PhD in Snapshots

A PhD on average takes 5 to 6 years of time. For those who aren’t familiar with the academic world, this length of time often causes some confusion. What can you possibly keep learning for 5 years? To help answer this question, I’ve joined together progress reports I wrote up during various stages of my PhD. These reports were a condition of my funding from the Hertz foundation,which kindly supported me through 5 years of my PhD. The Hertz foundation requested that I provide reports of my research progress twice a year to demonstrate that I was making effective use of my scholarship. Each progress report below was written at the date indicated and has only been lightly edited, and consequently provides a glimpse into my thoughts as I progressed through the PhD. My progress reports started from the Spring of 2014, my second year in the program, so I’ve written a brief introduction for the 2012-2013 year to provide context. The reports were written for my scholarship committee and don’t discuss all my emotional highs and lows, but do provide an accurate snapshot of my progress and hopes through the degree.

Prelude: Fall 2012 to Fall 2013

My first year in the program was mostly taken up by coursework. I took multiple classes on the theory and practice of machine learning, and various classes of algorithmic design. None of these classes particularly set my mind on fire, but machine learning was in the air, so I figured it was worth learning. I also “rotated” in various research groups. Stanford’s computer science department asked students to rotate between research groups, spending 3 months in each, to assess mutual fit between students and faculty. For those not familiar, the PhD is basically an extended apprenticeship between the student and their faculty advisor, so personal fit is absolutely critical. I rotated with two machine learning groups and a system software group. None of these particularly felt like a fit for me. The one bright spot in the year was landing the Hertz fellowship, which guaranteed five years of research support. With the fellowship in hand, I resolved to try researching things outside my core research competency of mathematical computer science. I took the summer off and visited family. I read the “Emperor of all Maladies” which got me very interested in the ideas around disease, so I resolved to talk to the computational biology faculty despite the fact that my last class in the life sciences had been in 9th grade. I talked to a dozen faculty or so, but felt drawn to the Pande group, which had worked for years on simulating proteins. Vijay (Pande) was interested in applying machine learning more systematically to the group’s work and encouraged me to hang around and spend time with his students. I ended up sticking around, and wrote a short paper on using Hidden Markov Models to analyze protein simulations in the fall of 2013, which we submitted to ICML (the international conference on machine learning).

Hertz Report Spring 2014:

This quarter, I submitted a paper to NIPS (Neural Information Processing Systems) on the use of switching linear dynamical systems in analyzing protein simulation datasets. Our method analyzes these datasets by learning locally quadratic approximations to the protein’s energy landscape. These approximate energy landscapes can be used to reconstruct the protein’s dynamics. I used the algorithm to analyze datasets for the opioid growth factor met-enkephalin and the proto-oncogene Src-kinase. Our results show significant improvement in model fidelity over previous analysis methods.

The implementation required for this paper took up the majority of my time over the last few months. The method uses semidefinite programming to compute approximate dynamics for the protein. However, standard semidefinite solvers cannot efficiently solve the large problems needed for protein analysis. As a result, I implemented a new semidefinite solver based upon the Frank-Wolfe algorithm. Debugging and stabilizing this implementation was a major effort. In future work, I plan to extend this library to enable learning for even larger biological and quantum-mechanical datasets. [Note: this never ended up happening…]

I took and passed a class on optimization theory this spring quarter (for Credit/No-Credit). I also spent some time revising a paper submitted to ICML (the International Conference on Machine Learning) earlier this year. We had to make a few changes to the paper to satisfy the reviewers. The revised paper was accepted, and I will be attending ICML in Beijing this summer.

The remainder of my time was spent doing preparatory work for a new collaboration with Google. Another graduate student and I submitted a proposal this spring to the Google assisted science (GAS) team on using Google’s scalable deep learning systems for improving rational drug design. Standard approaches to virtual screening of proposed drugs depend upon simple similarity measures to identify possible new drugs. Deep learning provides a richer toolbox, which may enable the identification of novel families of drugs for infectious diseases. After a few meetings with the GAS team, our proposal was recently accepted, and our lab has started collaborating with Google. Over the last few weeks, we have performed some preliminary exploration and testing of deep learning methods on a Stanford cluster. I will intern at Google this summer to perform the necessary implementation and to run the required experiments on Google’s systems.

Hertz Report Fall 2014:

I’ve spent the last six months working on a deep learning system to improve virtual screening for drug discovery. The goal of this system is to enable the computational evaluation of the suitability of drug-like compounds for modulating biological targets. Our results show robust improvements over baseline algorithms for this task. We plan to submit our results to ICML (the International Conference on Machine Learning) in February.

... continue reading