Software Engineer with strong background in Large Scale Data Analysis, Information Retrieval, Data Mining and Machine Learning in domains ranging from Security, Web-Content Classification, Online Advertising to more recently Mobile Social Gaming.
Currently I'm working on building Big Data pipelines and processing platforms using the Hadoop Ecosystem. I'm also working on making Hive queries 100X faster using the in-memory processing platform Spark/Shark from the AMP lab at UC Berkeley - https://amplab.cs.berkeley.edu/
Specialties: Shipping cool products, Getting stuff done.
Software Engineer - Analytics @ Built a Distributed Data Collection Farm across multiple data centers using Apache Flume NG and the Hadoop Distributed File System. Currently processing about 250 Million game events daily.
Used Hadoop Map-Reduce to process game events and compute various KPIs like DAU, ARPDAU, Retention rates et al. The KPIs are stored in Hbase and exposed via a REST interface.
Built a Click Server as a generic click tracking mechanism. Currently used for analyzing the performance of marketing campaigns, click through rates on emails and various viral channels like Facebook wall posts and Tweets.
Analyzing Engagement and Monetization behavior of Users by demographics using Hive. From August 2011 to Present (4 years 5 months) Software Engineer @ A/B testing and Large scale data analysis using map-reduce on Amazon EC2 for optimizing various parameters for displaying the most relevant Ad.
Deal Performance Prediction Engine using Least Squares Regression on publicly available deal data. From January 2011 to July 2011 (7 months) Research Analyst @ Statistical data analysis and research prototype development for anti-spam, anti-phishing and web content classification.
Web Categorization / Dynamic Content Analysis
Developed a tool to categorize web pages on the fly using a vector space model and cosine similarity to model documents per category. Evaluated efficacy using Confusion Matrices.
Developed a method to automatically extract relevant keywords per category using TF-IDF and labeled crawl data.
Developed a tool to identify the language of a document using stop/noise words as fingerprints.
Anti-Phishing
Built a linear classifier to identify phishing urls and emails from legitimate ones. Work involved identifying various phishing heuristics, training a perceptron and evaluating efficacy of the classifier using ROC curves.
Anti-Spam
Developed a system to gain insights into bad neighborhoods on the Internet using volume weighted spam probabilities aggregated at an Autonomous System level – work presented at the MIT Spam Conference. From June 2008 to December 2010 (2 years 7 months) Teaching Assistant @ Graded exams, homework for Introduction to Information Security class [Fall and Spring semester]. From August 2007 to July 2008 (1 year) Global Markets Technology Intern @ Built software tools for structuring and pricing complex Mortgage Backed Securities.
Monte Carlo Simulations for various interest rate scenarios. From June 2007 to August 2007 (3 months) Research Assistant @ Research work on anti-spam and anti-phishing with Prof Nick Feamster - work presented at Flocon 09, Scottsdale Arizona. From August 2006 to July 2007 (1 year) Member of Technical Staff @ Software development for multiple client projects. From June 2004 to July 2006 (2 years 2 months) Research Intern @ Worked on improving garbage collection in Java through Heap Reference Analysis. From July 2003 to June 2004 (1 year)
M.S, Computer Science @ Georgia Institute of Technology From 2006 to 2008 Ph.D. [on leave...], Computer Science @ Georgia Institute of Technology From 2006 to 2008 B.E, Computer Science and Engineering @ College of Engineering Pune From 2000 to 2004 Sagar Mehta is skilled in: Information Retrieval, Data Analysis, Machine Learning, Text Classification, Hadoop, Hive, Flume, Python, Java, C++, Anti-Spam, Anti-Phishing, Oozie, Linux, Git
Websites:
http://projects.csail.mit.edu/spamconf/SC2010/agenda.html,
http://www.cert.org/flocon/2009/proceedings.html,
http://download.cnet.com/Search/3000-2379_4-10490738.html?tag=rb_content;main