TOC

1. What are the most important machine learning techniques? 10

Solution 10

2. Why is it important to have a robust set of metrics for machine learning? 11

Solution 11

Code 12

3. Why are Features extraction and engineering so important in machine learning? 12

Solution 12

4. Can you provide an example of features extraction? 14

Solution 14

Code 14

5. What is a training set, a validation set, a test set and a gold set in supervised and unsupervised learning? 15

Solution 15

6. What is a Bias - Variance tradeoff? 16

Solution 16

7. What is a cross-validation and what is an overfitting? 17

Solution 17

Code 18

8. Why are vectors and norms used in machine learning? 18

Solution 18

Code 19

9. What are Numpy, Scipy and Spark essential datatypes? 19

Solution 19

Code 20

10. Can you provide an example for Map and Reduce in Spark? (Let’s compute the Mean Square Error) 20

Solution 20

Code 21

11. Can you provide examples for other computations in Spark? 22

Solution 22

Code 25

12. How does Python interact with Spark 26

Solution 26

13. What is Spark support for Machine Learning? 26

Solution 26

14. How does Spark work in a parallel environment 27

Solution 27

Code 27

15. What is the mean, the variance, and the covariance? 27

Solution 27

Code 28

16. What are percentiles and quartiles? 28

Solution 28

Code 28

17. Can you transform an XML file into Python Pandas? 29

Solution 29

Code 29

18. Can you read HTML into Python Pandas? 30

Solution 30

Code 30

19. Can you read JSON into Python Pandas? 31

Solution 31

Code 31

20. Can you draw a function from Python? 31

Solution 31

Code 31

21. Can you represent a graph in Python? 32

Solution 32

Code 32

22. What is an Ipython notebook? 33

Solution 33

Code 33

23. What is a convenient tool for performing data statistics? 34

Solution 34

Code 34

24. How is it convenient to visualize data statistics 35

Solution 35

Code 35

25. How to compute covariance and correlation matrices with pandas 36

Solution 36

Code 36

26. Can you provide an example of connection to the Twitter API? 37

Solution 37

Code 37

27. Can you provide an example of connection to the LinkedIn API? 39

Solution 39

Code 39

28. Can you provide an example of connection to the Facebook API? 39

Solution 39

Code 40

29. What is a TFxIDF? 40

Solution 40

Code 40

30. What is “features hashing”? And why is it useful for BigData? 41

Solution 41

31. What is “continuous features binning”? 42

Solution 42

32. What is an LP normalization? 42

Solution 42

Code 42

33. What is a Chi Square Selection? 42

Solution 42

34. What is mutual information and how can it be used for features selection? 43

Solution 43

35. What is a loss function, what are linear models, and what do we mean by regularization parameters in machine learning? 43

Solution 43

36. What is an odd ratio? 46

37. What is a sigmoid function and what is a logistic function? 46

Code 47

38. What is a gradient descent? 47

Solution 47

39. What is a stochastic gradient descent? 49

Solution 49

Code 49

40. What is a Linear Least Square Regression? 50

Solution 50

Code 51

41. What are Lasso, Ridge, and ElasticNet regularizations? 52

Solution 52

42. What is a Logistic Regression? 52

Solution 52

Code 53

43. What is a stepwise regression? 54

Solution 54

44. How to include nonlinear information into linear models 54

Solution 54

45. What is a Naïve Bayes classifier? 55

Solution 55

46. What is a Bernoulli and a Multivariate Naïve Bayes? 57

Solution 57

Code 58

47. What is a Gaussian? 59

Solution 59

Code 59

48. What is a Standard Scaling? 60

Solution 60

Code 60

49. Why are statistical distributions important? 61

Solution 61

Code 63

50. Can you compare your data with some distribution? What is a qq-plot? 63

Solution 63

Code 63

51. What is a Gaussian Naïve Bayes? 64

Solution 64

52. What is another way to use Naïve Bayes with continuous data? 64

Solution 64

53. What is the Nearest Neighbor classification? 65

Solution 65

Code 66

54. What are Support Vector Machines (SVM)? 66

Solution 66

Code 68

55. What are SVM Kernel tricks? 68

Solution 68

56. What is K-Means Clustering? 70

Solution 70

Code 71

57. Can you provide an example for Text Classification with Spark? 71

Solution 71

Code 71

58. Where to go from here 72

Appendix A 75

59. Ultra-Quick introduction to Python 75

60. Ultra-Quick introduction to Probabilities 76

61. Ultra-Quick introduction to Matrices and Vectors 76

## No comments:

## Post a Comment