Monday, October 26, 2015

A collection of Data Science Interview Questions Solved in Python and Spark: Hands-on Big Data and Machine Learning



















TOC

1. What are the most important machine learning techniques? 10
Solution 10
2. Why is it important to have a robust set of metrics for machine learning? 11
Solution 11
Code 12
3. Why are Features extraction and engineering so important in machine learning? 12
Solution 12
4. Can you provide an example of features extraction? 14
Solution 14
Code 14
5. What is a training set, a validation set, a test set and a gold set in supervised and unsupervised learning? 15
Solution 15
6. What is a Bias - Variance tradeoff? 16
Solution 16
7. What is a cross-validation and what is an overfitting? 17
Solution 17
Code 18
8. Why are vectors and norms used in machine learning? 18
Solution 18
Code 19
9. What are Numpy, Scipy and Spark essential datatypes? 19
Solution 19
Code 20
10. Can you provide an example for Map and Reduce in Spark? (Let’s compute the Mean Square Error) 20
Solution 20
Code 21
11. Can you provide examples for other computations in Spark? 22
Solution 22
Code 25
12. How does Python interact with Spark 26
Solution 26
13. What is Spark support for Machine Learning? 26
Solution 26
14. How does Spark work in a parallel environment 27
Solution 27
Code 27
15. What is the mean, the variance, and the covariance? 27
Solution 27
Code 28
16. What are percentiles and quartiles? 28
Solution 28
Code 28
17. Can you transform an XML file into Python Pandas? 29
Solution 29
Code 29
18. Can you read HTML into Python Pandas? 30
Solution 30
Code 30
19. Can you read JSON into Python Pandas? 31
Solution 31
Code 31
20. Can you draw a function from Python? 31
Solution 31
Code 31
21. Can you represent a graph in Python? 32
Solution 32
Code 32
22. What is an Ipython notebook? 33
Solution 33
Code 33
23. What is a convenient tool for performing data statistics? 34
Solution 34
Code 34
24. How is it convenient to visualize data statistics 35
Solution 35
Code 35
25. How to compute covariance and correlation matrices with pandas 36
Solution 36
Code 36
26. Can you provide an example of connection to the Twitter API? 37
Solution 37
Code 37
27. Can you provide an example of connection to the LinkedIn API? 39
Solution 39
Code 39
28. Can you provide an example of connection to the Facebook API? 39
Solution 39
Code 40
29. What is a TFxIDF? 40
Solution 40
Code 40
30. What is “features hashing”? And why is it useful for BigData? 41
Solution 41
31. What is “continuous features binning”? 42
Solution 42
32. What is an LP normalization? 42
Solution 42
Code 42
33. What is a Chi Square Selection? 42
Solution 42
34. What is mutual information and how can it be used for features selection? 43
Solution 43
35. What is a loss function, what are linear models, and what do we mean by regularization parameters in machine learning? 43
Solution 43
36. What is an odd ratio? 46
37. What is a sigmoid function and what is a logistic function? 46
Code 47
38. What is a gradient descent? 47
Solution 47
39. What is a stochastic gradient descent? 49
Solution 49
Code 49
40. What is a Linear Least Square Regression? 50
Solution 50
Code 51
41. What are Lasso, Ridge, and ElasticNet regularizations? 52
Solution 52
42. What is a Logistic Regression? 52
Solution 52
Code 53
43. What is a stepwise regression? 54
Solution 54
44. How to include nonlinear information into linear models 54
Solution 54
45. What is a Naïve Bayes classifier? 55
Solution 55
46. What is a Bernoulli and a Multivariate Naïve Bayes? 57
Solution 57
Code 58
47. What is a Gaussian? 59
Solution 59
Code 59
48. What is a Standard Scaling? 60
Solution 60
Code 60
49. Why are statistical distributions important? 61
Solution 61
Code 63
50. Can you compare your data with some distribution? What is a qq-plot? 63
Solution 63
Code 63
51. What is a Gaussian Naïve Bayes? 64
Solution 64
52. What is another way to use Naïve Bayes with continuous data? 64
Solution 64
53. What is the Nearest Neighbor classification? 65
Solution 65
Code 66
54. What are Support Vector Machines (SVM)? 66
Solution 66
Code 68
55. What are SVM Kernel tricks? 68
Solution 68
56. What is K-Means Clustering? 70
Solution 70
Code 71
57. Can you provide an example for Text Classification with Spark? 71
Solution 71
Code 71
58. Where to go from here 72
Appendix A 75
59. Ultra-Quick introduction to Python 75
60. Ultra-Quick introduction to Probabilities 76
61. Ultra-Quick introduction to Matrices and Vectors 76

No comments:

Post a Comment