Top 3 Books a New Data Scientist Should Read

Photo by Paweł Czerwiński on Unsplash

Starting fresh in data science can be challenging sometimes. There are plenty of courses out there that promise to teach you everything from how to code to what is Principal Component Analysis (PCA) or train a Deep Neural Network. Some of them might do. But in my opinion, there are some books that can greatly benefit any data scientist. Shortlisting all the books I think are likely to be useful will rather confuse the reader as to where to start from. Besides I am sure there are tons of resources to find lots of books in this field. In this post, I will focus only on the top 3 books that in my opinion are sufficient to not just learn the basics but go far beyond. These books are rather classic (but still very relevant). If you are already a practicing data scientist who wants to learn about the latest developments in deep learning or so then this might not help you as much but still might provide you with some resources to refresh the fundamentals.

1. Machine Learning by Tom Mitchell

This is one of the first books I read to understand and learn about machine learning in general. In particular, I enjoyed the pages where it describes how a decision tree works; you know the building block of perhaps the more popular random forest approach. So, it teaches you to calculate entropy (the amount of information) recursively for each attribute/feature in your data. I encourage you to read that part as it is as straightforward as teaching it can get — check out page 52 onwards. The language is simple and there are plenty of details and examples taking you by the hand and showing you how actually the algorithm can learn patterns from the data.

2. Elements of Statistical Learning by Trevor Hastie

This book approaches machine learning from a statistical perspective, which I believe is essential to understand why machine learning algorithms actually work, what could maybe go wrong and understand the importance of having good data. The book spends a lot of pages on linear models but it is worth it as you will get slowly introduced to the concepts. But don’t think that this is just that. It covers everything including neural networks even if it doesn’t get into Deep Learning, which is a “recent” trend anyway! The examples presented along with the associated graphs help the reader grasp the concepts and gain an understanding of the described method. Just as a note, the book includes some maths and the corresponding math notation but don’t get put off by that. Just embrace how elegant formulas can capture the essence of the section. Besides, the description should be sufficient to understand the approach. Reading it is definitely worthwhile. It can also serve as a reference for data science practitioners to remind you of the model assumptions, differences between models and as a general data science refresher.

3. Pattern Recognition and Machine Learning by Christopher Bishop

The book needs no introduction. One of the best machine learning books ever, and recommended textbooks for the machine learning courses at Imperial College London, UK, a few years back. Starting with concepts that are crucial in machine learning like the curse of dimensionality, probabilities and distributions, decision theory and information theory to more advanced mathematical concepts required for machine learning. In order to follow the more advanced concepts, knowledge of mathematics and statistics would be useful. Nevertheless, a must-have book in your collection.


Classification Models Pros and Cons


Classification Model Pros and Cons (Generalized)

  • Logistic Regression
    • Pros
      • low variance
      • provides probabilities for outcomes
      • works well with diagonal (feature) decision boundaries
      • NOTE: logistic regression can also be used with kernel methods
    • Cons
      • high bias
  • Decision Trees
    • Regular (not bagged or boosted)
      • Pros
        • easy to interpret visually when the trees only contain several levels
        • Can easily handle qualitative (categorical) features
        • Works well with decision boundaries parellel to the feature axis
      • Cons
        • prone to overfitting
        • possible issues with diagonal decision boundaries
    • Bagged Trees : train multiple trees using bootstrapped data to reduce variance and prevent overfitting
      • Pros
        • reduces variance in comparison to regular decision trees
        • Can provide variable importance measures
          • classification: Gini index
          • regression: RSS
        • Can easily handle qualitative (categorical) features
        • Out of bag (OOB) estimates can be used for model validation
      • Cons
        • Not as easy to visually interpret
        • Does not reduce variance if the features are correlated
    • Boosted Trees : Similar to bagging, but learns sequentially and builds off previous trees
      • Pros
        • Somewhat more interpretable than bagged trees/random forest as the user can define the size of each tree resulting in a collection of stumps (1 level) which can be viewed as an additive model
        • Can easily handle qualitative (categorical) features
      • Cons
        • Unlike bagging and random forests, can overfit if number of trees is too large
  • Random Forest
    • Pros
      • Decorrelates trees (relative to bagged trees)
        • important when dealing with mulitple features which may be correlated
      • reduced variance (relative to regular trees)
    • Cons
      • Not as easy to visually interpret
  • SVM
    • Pros
      • Performs similarly to logistic regression when linear separation
      • Performs well with non-linear boundary depending on the kernel used
      • Handle high dimensional data well
    • Cons
      • Susceptible to overfitting/training issues depending on kernel
  • Neural Network (This section needs further information based on different types of NN’s)
  • Naive Bayes
    • Pros
      • Computationally fast
      • Simple to implement
      • Works well with high dimensions
    • Cons
      • Relies on independence assumption and will perform badly if this assumption is not met

Detecting malicious behaviour in participatory sensing settings

Security is crucial in modern computer systems hosting private and sensitive information. Our systems are vulnerable to a number of malicious threats such as ransomware, malware and viruses.  Recently, a global cyberattack (ransomware) affected hundred of organisations, most notably the UK’s NHS.  This malicious software “locked” the content stored on organisations’ hard drives, requiring money (to be paid in bitcoins) to “unlock” it and make it available back to their owners. Crowdsourcing (the practice of obtaining information by allocating tasks to a large number of people e.g. Wikipedia) is not immune of malicious behaviour. On the contrary, the very openness of such systems make them ideal for malicious users to alter, corrupt or falsify information (data poisoning). In this post, we present an environmental monitoring example, where ordinary people take air quality readings (using mobile equipment) to monitor air pollution of their city or neighbourhood (see our previous post for more details on this example). Arguably, some people participating in such environmental campaigns can be malicious. Specifically, instead of taking readings to provide information about their environment,  they might deviate by following their own secret agenda. For instance, a factory owner might alter the readings showing that their factory pollutes the environment. The impact of such falsification is huge as it basically changes the overall picture of the environment, which in turn leads authorities to wrong actions regarding urban planning.

We argue that Artificial Intelligence (AI) techniques can be of great help in this domain. Given that measurements have a spatio-temporal correlation, a non-linear regression model can be overlaid over the environment (see previous post). The tricky part however is to differentiate between truthful and malicious readings. A plausible solution is to extend the non-linear regression model by assuming that each measurement has an individual and independent noise (variance) from each other (heteroskedasticity). For instance, a Gaussian Process (GP) model can be initially used and then extended to Heteroskedastic GP (HGP). The consequence of this action is that this individual noise can indicate the deviation of each measurement compared to the truthful measurements, which can either be attributed to sensor noise (which is always present in reality) or in malicious readings. An extended version of HGP, namely Trust-HGP (THGP), assigns a trust parameter to the model that captures the possibility of each measurement being malicious between the interval of (0,1).  The details of the THGP model as well as how it is utilised in this domain will be presented end of October at the fifth AAAI conference on human computation and crowdsourcing (HCOMP 2017). Stay tuned!

How AI and humans can optimise air pollution monitoring

Air pollution is responsible for 7 million deaths per year according to World Health Organization (WHO). Thus, it is crucial to dedicate resources to learn and monitor air quality in cities to assist authorities in urban planning as well as bring awareness to people about the impact of air pollution to their everyday life. In our research, we provide the framework and the algorithms, utilising the power of Machine Learning to effectively monitor an environment over time.
In particular, our proposal relies on the willingness of people to participate in environmental air quality campaigns. People can use  mobile air quality devices to take readings in their city or their neighbourhood. However, the major issue is when and where these readings should be taken to efficiently monitor the city. People cannot provide an unlimited number of measurements and thus readings should be taken in a way such that information about the environment is maximised. In other words, we need to solve an optimisation problem constrained on the number of readings people can provide over a period of time to facilitate an efficient environment exploration.
In order to solve the problem, we need to model the environment in a certain way as well as a way to measure the information entailed in each reading (since we are interested in gaining the most information by taking a limited number of readings). To do that, we overlay a spatio-temporal stochastic process over the area of interest (Gaussian Processes). Gaussian processes can be used to interpolate over the environment, i.e., predict the air quality value at unobserved locations as well as predict the state of the environment into the future. Importantly, Gaussian Processes can also be used to provide a measure of uncertainty/information about each location in space and time (by utilising predictive variance).
The problem is evolved into taking a set of measurements such that a utility function, created based on predictive variance provided by Gaussian Processes, is maximised. Going a step forward, to solve this problem, we use techniques and algorithms from the broad areas of Artificial Intelligence and Multi-agent systems.
In particular, an intelligent agent can decide when and where measurements should be taken to maximise information gained about the air quality, while at the same time minimise the number of readings needed. The agent can employ greedy search techniques combined with meta-heuristics such as stochastic local search, unsupervised learning (clustering) and random simulations.
The main idea is to simulate the environment over time, asking what if kind of questions. What if i take a measurement now, and one in the night. What if i take measurement downtown or near the home. These kind of questions are answered  by running simulations on a cluster computing facility.
Finally, our findings indicate a significant improvement over other approaches.

Data Science Interview Preparation

This is a post for all out there who are trying to get to the data science industry, and not only. Data science is a term used to describe a wide variety of jobs in industry, which causes a confusion. In this post, I will summarise the most important questions, and answers related to data science as I understand it. If you feel you can contribute in the list of Q/A, or think some are inaccurate or something is missing, please feel free to contact me.

Sources: KDnuggets, Quora, Stackoverflow, analyticsvidhya, Udacity, Stackexchange, Wikipedia, Pattern Recognition and Machine Learning (Christopher Bishop), Machine Learning: A probabilistic Perspective (Kevin Murphy), Data Science Central, clockbackward, simplystatistics

What do you think a data scientist is/does?



What do you think are the most important skills for a data scientist to have?

Python Coding, Statistics, Machine Learning, SQL Database, Intellectual curiosity, Business acumen, Communication skills

Which machine learning model (classification vs. regression, for example) to use given a particular problem and what are the tradeoffs between different types of classification models/between different types of regression models?

Decision Trees:

  • Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. Trees can also be displayed graphically in a way that is easy for non-experts to interpret.
  • Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. (For example, relation rules can be used only with nominal variables while neural networks can be used only with numerical variables.)
  • Requires little data preparation. Other techniques often require data normalization. Since trees can handle qualitative predictors, there is no need to create dummy variables.
  • Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model, the explanation for the results is typically difficult to understand, for example with an artificial neural network.
  • Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
  • Robust. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
  • Performs well with large datasets. Large amounts of data can be analysed using standard computing resources in reasonable time.
  • Mirrors human decision making more closely than other approaches.This could be useful when modeling human decisions/behaviour


  • Trees do not tend to be as accurate as other approaches.
  • Trees can be very non-robust. A small change in the training data can result in a big change in the tree, and thus a big change in final predictions.
  • The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm where locally-optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally-optimal decision tree. To reduce the greedy effect of local-optimality some methods such as the dual information distance (DID) tree were proposed.
  • Decision-tree learners can create over-complex trees that do not generalize well from the training data. (This is known as overfitting.) Mechanisms such as pruning are necessary to avoid this problem (with the exception of some algorithms such as the Conditional Inference approach, that does not require pruning).
  • There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. In such cases, the decision tree becomes prohibitively large. Approaches to solve the problem involve either changing the representation of the problem domain (known as propositionalization) or using learning algorithms based on more expressive representations (such as statistical relational learning or inductive logic programming).
  • For data including categorical variables with different numbers of levels, information gain in decision trees is biased in favor of those attributes with more levels. However, the issue of biased predictor selection is avoided by the Conditional Inference approach.


  • Guaranteed Optimality: Due to the nature of Convex Optimization, the solution is guaranteed to be the global minimum not a local minimum.
  • Abundance of Implementations: from libsvm(Phyton) to >>fitcsvm (Matlab) it is conveniently accessed. These libraries may be utilized even though the user is indifferent to the KKT Conditions, Lagrange Multipliers etc.
  • It is useful for both Linearly Seperable(hard margin) and Non-linearly Seperable(soft margin) data. The only thing to do is to come up with the optimal penalty variable C (the one that multiplies slack variables)
  • Conformity with Semi-Supervised Learning: It may be used in a dataset where some of the data are labeled and some are not. You only add an additional condition to the minimization problem and it is called Transductive SVM.
  • Feature Mapping might have been a burden on the computational complexity of the overall training performance; however, thanks to the ‘Kernel Trick’ the feature mapping is implicitly carried out via simple dot products. See the video for the magic of Kernels:


  • In Natural Language Processing, structured representations of text yield better performances. Sadly, SVMs can not accomodate such structures(word embeddings) and are used through Bag-of-Words representation which loses sequantiality information and leads to worse performance.
  • SVM in its vanilla form cannot return a probabilistic confidence value like logistic regression does, in some sense it’s not ‘explanatory’ enough. Confidence of prediction can be quite important in many applications.



  • Simple to implement
  • Flexible to feature / distance choices
  • Naturally handles multi-class cases
  • Can do well in practice with enough representative data


  • Large search problem to find nearest neighbours
  • Storage of data
  • Must know we have a meaningful distance function

Neural Nets


  •  Accuracy in image recognition problems.


  • High computational cost.
  • If you don’t have a good GPU they are quite slow to train (for complex tasks).
  • They use to need a lot of training data.

What is the pros and cons of Convolutional neural networks?. Available from: [accessed May 1, 2017].

Linear Regression


  • It is easy to implement on a computer using commonly available algorithms from linear algebra.
  • Its implementation on modern computers is efficient, so it can be very quickly applied even to problems with hundreds of features and tens of thousands of data points.
  • It is easier to analyze mathematically than many other regression techniques.
  • It is not too difficult for non-mathematicians to understand at a basic level.
  • It produces solutions that are easily interpretable (i.e. we can interpret the constants that least squares regression solves for).


  • Least squares regression can perform very badly when some points in the training data have excessively large or small values for the dependent variable compared to the rest of the training data.
  • All linear regression methods (including, of course, least squares regression), suffer from the major drawback that in reality most systems are not linear.
  • Curse of Dimensionality. Least squares regression is particularly prone to this problem, for as soon as the number of features used exceeds the number of training data points, the least squares solution will not be unique, and hence the least squares algorithm will fail.
  • The least squares method can sometimes lead to poor predictions when a subset of the independent variables fed to it are significantly correlated to each other.
  • Sum of squared errors is the best measure of error for us to try and minimize. The simple conclusion is that the way that least squares regression measures error is often not justified.
  • The level of noise in our data may be dependent on what region of our feature space we are in. Some points in our training data are more likely to be effected by noise than some other such points, which means that some points in our training set are more reliable than others. We don’t want to ignore the less reliable points completely (since that would be wasting valuable information) but they should count less in our computation, which is why there has to be homoskedasticity.
  • Selecting the wrong independent variables (i.e. features) for a prediction problem is one that plagues all regression methods.
  • While least squares regression is designed to handle noise in the dependent variable, the same is not the case with noise (errors) in the independent variables.

Non-linear Regression


  • Model complex real-life phenomena.


  • Parametric vs Non-Parametric
  • A lot of models to choose from

Explain the differences between parametric and non-parametric models


Parametric models assume some finite set of parameters. The complexity of the model is bounded even if the amount of data is unbounded.


Assume that the data distribution cannot be defined in terms of such a finite set of parameters.

How to find good predictors? How to find relationships between your variables?

Standardised regression coefficients

Change in R-squared when the variable is added to the model last

Different ways of controlling for model complexity?

Regularisation term. You could do things like reducing the number of parameters, for instance by reducing the depth of a decision tree, or sharing parameters, etc. Also, it always helps to have more data.

How to model a quantity that you can’t directly observe (using Bayesian approaches, for example, and when doing so, how to choose prior distributions).

You can use Monte-Carlo simulations to guess an approximate distribution of possible answers, and run the simulation many times.

Each data point can be associated latent (or unobserved) variables. Bayesian reasoning.

Explain the various numerical optimization techniques

Gradient Descent:  is a first-order iterative optimisation algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient(or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent. However, gradient descent should not be confused with the method of steepest descent for approximating integrals.


Explain maximum likelihood and maximum a posteriori

Maximum Likelihood Estimation (MLE):  MLE can be seen as a special case of the maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters, or as a variant of the MAP that ignores the prior and which therefore is unregularised. MLE takes the mean and variance as parameters and finds particular parametric values that make the observed results the most probable given the model.  The method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the “agreement” of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution

Maximum a posteriori (MAP): It is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to Fisher’s method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution (that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularisation of ML estimation.

Dealing with correlated features in your data set, how to reduce the dimensionality of data?

Dealing with multicollinearity is important. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. Under perfect multicollinearity the feature matrix cannot be inverted, thus OLS is not possible.

Drop features. An explanatory variable may be dropped to produce a model with significant coefficients. However, you lose information (because you’ve dropped a variable). Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable.

Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn’t affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.

Obtain more data, if possible.

In order to reduce the dimensionality of the data you could use a dimensionality reduction technique (PCA, PPCA etc.) or feature selection, depending on the nature of your data.

What is feature extraction? Feature engineering? Feature Selection? How do you decide what features do you need?

Start from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations.

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning.

Feature selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for four reasons:

  • simplification of models to make them easier to interpret by researchers/users,
  • shorter training times,
  • to avoid the curse of dimensionality ,
  • enhanced generalization by reducing overfitting (formally, reduction of variance)

How you tested and validated the results? What types of crossvalidation do you know?

Leave-p-out cross-validation: Leave-p-out cross-validation (LpO CV) involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set.

Leave-one-out cross-validation: Leave-one-out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1.

k-fold cross-validation: In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter. In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels.

Holdout method: In the holdout method, we randomly assign data points to two sets d0 and d1, usually called the training set and the test set, respectively. The size of each of the sets is arbitrary although typically the test set is smaller than the training set. We then train on d0 and test on d1.

Repeated random sub-sampling validation: It randomly splits the dataset into training and validation data. For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. The results are then averaged over the splits. The advantage of this method (over k-fold cross validation) is that the proportion of the training/validation split is not dependent on the number of iterations (folds). The disadvantage of this method is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. In other words, validation subsets may overlap. This method also exhibits Monte Carlo variation, meaning that the results will vary if the analysis is repeated with different random splits.As the number of random splits approaches infinity, the result of repeated random sub-sampling validation tends towards that of leave-p-out cross-validation.

When you get a new data set, what do you do with it to see if it will suit your needs for a given project?

Everyone might have a slightly different approach, but here is an answer that includes all the basic but necessary steps.

Figure out what I’m trying to do with the data. What questions can actually be answered with the data and what can’t. “Play” with the data to figure out what the data set “looks” like by looking at the types of variables I have and/or looking on what the first few observations and last few observations look like. If the data set is really big, I usually take a carefully chosen random subsample to make it possible to do my exploration interactively. After doing that I look for weird quirks, like if there are missing values or outliers. Clean/organize I usually use the first exploration to figure out things that need to be fixed so that I can mess around with a tidy data set. This includes fixing up missing value encoding.  After getting a handle with mostly text based tables and output (things that don’t require a graphics device) and cleaning things up a bit I start with plotting everything. Get the maximum amount of information about the data set in the minimal amount of time. So I do not make the graphs pretty (I think there is a distinction between exploratory and expository graphics). I do histograms and jittered one d plots to look at variables one by one. To compare the distributions of variables I usually use overlayed density plots. I make tons of scatterplots to look at relationships between variables. After I have a feel for the data I usually try to come up with a quick and dirty answer to the question I care about. This might be a simple predictive model (I usually use 60% training, 40% test) or a really basic regression model when possible, just to see if the signal is huge, medium or subtle. I use this as a place to start when doing the rest of the analysis.

by Jeff Leek (Edited by Alexandros Zenonos)

How do you handle big data sets — how would you start work on a project with an associated data set that was many tens of GB or larger?

In order to deal with big data, you need to adopt a big data approach. Think of Big Data architectures and technologies.

For example, Hadoop is an open-source software framework used for distributed storage and processing of big data sets using the MapReduce programming model. It consists of computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

MapReduce is composed of several components, including:

  • JobTracker — the master node that manages all jobs and resources in a cluster
  • TaskTrackers — agents deployed to each machine in the cluster to run the map and reduce tasks
  • JobHistoryServer — a component that tracks completed jobs, and is typically deployed as a separate function or with JobTracker

For example, users can list and count the number of times every word appears in a novel as a single server application, but that is time consuming. By contrast, users can split the task among 26 people, so each takes a page, writes a word on a separate sheet of paper and takes a new page when they’re finished. This is the map aspect of MapReduce. And if a person leaves, another person takes his place. This exemplifies MapReduce’s fault-tolerant element.

Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark’s RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.

A NoSQL database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the “NoSQL” moniker until a surge of popularity in the early twenty-first century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and Amazon. NoSQL databases are increasingly used in big data and real-time web applications.

Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

Cassandra also places a high value on performance. In 2012, University of Toronto researchers studying NoSQL systems concluded that “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments” although “this comes at the price of high write and read latencies.”

What are the assumptions required for linear regression?

(i) linearity and additivity of the relationship between dependent and independent variables:

(a) The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed.

(b) The slope of that line does not depend on the values of the other variables.

(c)  The effects of different independent variables on the expected value of the dependent variable are additive.

(ii) statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data)

(iii) homoscedasticity (constant variance) of the errors

(a) versus time (in the case of time series data)

(b) versus the predictions

(c) versus any independent variable

(iv) normality of the error distribution.

Explain Kernel Trick.

The main idea for using the kernel trick in SVMs is to bring data to a higher dimension where they are separable.

What is precision and what is recall? What is a confusion matrix?

Precision= True Positives/ (True Positives + False Positives)
Recall=True Positives/ (True Positives + False Negatives)

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

How do you detect outliers? How would you screen for outliers and what should you do if you find one?

Grubbs test: is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population.  That is, one should first verify that the data can be reasonably approximated by a normal distribution before applying the Grubbs’ test. Grubbs’ test detects one outlier at a time. This outlier is expunged from the dataset and the test is iterated until no outliers are detected. However, multiple iterations change the probabilities of detection, and the test should not be used for sample sizes of six or fewer since it frequently tags most of the points as outliers.

Tietjen-Moore Test: This is a generalization of the Grubbs’ test to the case of more than one outlier. It has the limitation that the number of outliers must be specified exactly.

Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as “normal” and “abnormal” and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then testing the likelihood of a test instance to be generated by the learnt model.
Some methods to screen outliers are z-scores, modified z-score, box plots, Grubb’s test, Tietjen-Moore test exponential smoothing, Kimber test for exponential distribution and moving window filter algorithm. However two of the robust methods in detail are: Inter Quartile Range ,
Tukey Method

Explain what regularization is and why it is useful?
Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting. L1 or L2 norm could be used.

How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.

If the values predicted by the model are far outside of the response variable range, this would immediately indicate poor estimation or model inaccuracy.
If the values seem to be reasonable, examine the parameters; any of the following would indicate poor estimation or multi-collinearity: opposite signs of expectations, unusually large or small values, or observed inconsistency when the model is fed new data.
Use the model for prediction by feeding it new data, and use the coefficient of determination (R squared) as a model validity measure.
Use data splitting to form a separate dataset for estimating model parameters, and another for validating predictions.
Use jackknife resampling if the dataset contains a small number of instances, and measure validity with R squared and mean squared error (MSE).

What is overfitting and how to avoid it?
Overfitting is when you build a predictive model that fits the data “too closely”, so that it captures the random noise in the data rather than true patterns. As a result, the model predictions will be wrong when applied to new data.

We frequently hear about studies that report unusual results (especially if you listen to Wait Wait Don’t Tell Me) , or see findings like “an orange used car is least likely to be a lemon”, or learn that studies overturn previous established findings (eggs are no longer bad for you).

Many such studies produce questionable results that cannot be repeated.

This is a big problem, especially in social sciences or medicine, when researchers frequently commit the cardinal sin of Data Science – Overfitting the data.

he researchers test too many hypotheses without proper statistical control, until they happen to find something interesting. Then they report it. Not surprisingly, next time the effect (which was partly due to chance) will be much smaller or absent.

These flaws of research practices were identified and reported by John P. A. Ioannidis in his landmark paper Why Most Published Research Findings Are False (PLoS Medicine, 2005). Ioannidis found that very often either the results were exaggerated or the findings could not be replicated. In his paper, he presented statistical evidence that indeed most claimed research findings are false!

Ioannidis noted that in order for a research finding to be reliable, it should have:

Large sample size and with large effects
Greater number of and lesser selection of tested relationship
Greater flexibility in designs, definitions, outcomes, and analytical modes
Minimal bias due to financial and other factors (including popularity of that scientific field)

How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?

Often it is observed that in the pursuit of rapid innovation (aka “quick fame”), the principles of scientific methodology are violated leading to misleading innovations, i.e. appealing insights that are confirmed without rigorous validation. One such scenario is the case that given the task of improving an algorithm to yield better results, you might come with several ideas with potential for improvement.

An obvious human urge is to announce these ideas ASAP and ask for their implementation. When asked for supporting data, often limited results are shared, which are very likely to be impacted by selection bias (known or unknown) or a misleading global minima (due to lack of appropriate variety in test data).

Data scientists do not let their human emotions overrun their logical reasoning. While the exact approach to prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything would depend on the actual case at hand, there are a few common guidelines:
Ensure that there is no selection bias in test data used for performance comparison
Ensure that the test data has sufficient variety in order to be symbolic of real-life data (helps avoid overfitting)
Ensure that “controlled experiment” principles are followed i.e. while comparing performance, the test environment (hardware, etc.) must be exactly the same while running original algorithm and new algorithm
Ensure that the results are repeatable with near similar results
Examine whether the results reflect local maxima/minima or global maxima/minima

What is root cause analysis?

Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event’s outcome, but is not a root cause.
Root cause analysis was initially developed to analyze industrial accidents, but is now widely used in other areas, such as healthcare, project management, or software testing.

Here is a useful Root Cause Analysis Toolkit from the state of Minnesota.

Essentially, you can find the root cause of a problem and show the relationship of causes by repeatedly asking the question, “Why?”, until you find the root of the problem. This technique is commonly called “5 Whys”, although is can be involve more or less than 5 questions.

What is statistical power?
Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is).

Explain what resampling methods are and why they are useful. Also explain their limitations.
Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample.

Resampling refers to methods for doing one of these
Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)
Validating models by using random subsets (bootstrapping, cross validation)

Is it better to have too many false positives, or too many false negatives? Explain.
It depends on the question as well as on the domain for which we are trying to solve the question.

In medical testing, false negatives may provide a falsely reassuring message to patients and physicians that disease is absent, when it is actually present. This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. So, it is desired to have too many false positive.

For spam filtering, a false positive occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. So, we prefer too many false negatives over many false positives.

What is selection bias, why is it important and how can you avoid it?
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation.

What is the curse of dimensionality?
Bottom line is, the data that can be represented using 10 space units of one true dimension, needs 1000 space units after adding 2 more dimensions just because we observed these dimensions during the experiment. The true dimension means the dimension which accurately generalize the data and observed dimensions means whatever other dimensions we consider in dataset which may or may not contribute to accurately generalize the data.

What is the difference between “long” (“tall”) and “wide” format data?
In most data mining / data science applications there are many more records (rows) than features (columns) – such data is sometimes called “tall” (or “long”) data.

In some applications like genomics or bioinformatics you may have only a small number of records (patients), eg 100, but perhaps 20,000 observations for each patient. The standard methods that work for “tall” data will lead to overfitting the data, so special approaches are needed.

More dimensions -> More training data required

How would you use either the extreme value theory, Monte Carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?

Extreme value theory (EVT) focuses on rare events and extremes, as opposed to classical approaches to statistics which concentrate on average behaviors. EVT states that there are 3 types of distributions needed to model the the extreme data points of a collection of random observations from some distribution: the Gumble, Frechet, and Weibull distributions, also known as the Extreme Value Distributions (EVD) 1, 2, and 3, respectively.

The EVT states that, if you were to generate N data sets from a given distribution, and then create a new dataset containing only the maximum values of these N data sets, this new dataset would only be accurately described by one of the EVD distributions: Gumbel, Frechet, or Weibull. The Generalized Extreme Value Distribution (GEV) is, then, a model combining the 3 EVT models as well as the EVD model.

Knowing the models to use for modeling our data, we can then use the models to fit our data, and then evaluate. Once the best fitting model is found, analysis can be performed, including calculating possibilities.

What is a recommendation engine? How does it work?

We are all familiar now with recommendations from Netflix – “Other Movies you might enjoy” or from Amazon – Customers who bought X also bought Y.

Collaborative filtering methods build a model based on users past behavior (items previously purchased, movies viewed and rated, etc) and use decisions made by current and other users. This model is then used to predict items (or ratings for items) that the user may be interested in.

Content-based filtering methods use features of an item to recommend additional items with similar properties. These approaches are often combined in Hybrid Recommender Systems.

How can you determine which features are the most important in your model?
In applied machine learning, success depends significantly on the quality of data representation (features). Highly correlated features can make learning/sorting steps in the classification module easy. Conversely if label classes are a very complex function of the features, it can be impossible to build a good model. Thus a so-called feature engineering, a process of transforming data into features that are most relevant to the problem, is often needed.
A feature selection scheme often involves techniques to automatically select salient features from a large exploratory feature pool. Redundant and irrelevant features are well known to cause poor accuracy so discarding these features should be the first task. The relevance is often scored using mutual information calculation. Furthermore, input features should thus offer a high level of discrimination between classes. The separability of features can be measured by distance or variance ratio between classes. One recent work [Pham 2016] proposed a systematic voting based feature selection that is a data-driven approach incorporating above criteria. This can be used as a common framework for a wide class of problems.A data-driven feature selection approach incorporating several saliency criteria [Pham 2016].
Another approach is penalizing on the features that are not very important (e.g., yield a high error metric) when using regularization methods like Lasso or Ridge.

What is the idea behind ensemble learning?
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
There are main 3 widely used ensembles techniques:


Bagging (stands for Bootstrap Aggregation) is the way decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multisets of the same cardinality/size as your original data. By increasing the size of your training set you can’t improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected outcome.


Boosting is a two-step approach, where one first uses subsets of the original data to produce a series of averagely performing models and then “boosts” their performance by combining them together using a particular cost function (=majority vote). Unlike bagging, in the classical boosting the subset creation is not random and depends upon the performance of the previous models: every new subsets contains the elements that were (likely to be) misclassified by previous models. It is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa, but what is the way to increase the weights.


Stacking is a similar to boosting: you also apply several models to your original data. The difference here is, however, that you don’t have just an empirical formula for your weight function, rather you introduce a meta-level and use another model/approach to estimate the input together with outputs of every model to estimate the weights or, in other words, to determine what models perform well and what badly given these input is a way of combining models and it is used for a learner to combine output from different learners. This can lead to decrease in either bias or variance error depending on the combining learner we use.


In unsupervised learning, if a ground truth about a dataset is unknown, how can we determine the most useful number of clusters to be?

The Elbow Method – The elbow method is interested in explaining variance as a function of cluster numbers (the k in k-means). Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test.17-qa-2-image04.png

The Silhouette Method – The silhouette method measures the similarity of an object to its own cluster — called cohesion — when compared to other clusters — called separation.


Linear Regression- Estimate parameters with least square or maximum likelihood?

Under normal error assumption, as is typically assumed in linear regression, the MLE and the LSE are the same!

They are different in a sense that OLS finds parameters that minimise the square distance between the predicted and actual data point whilst MLE finds parameters that make the model plausible.


You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. Following are the methods you can use to tackle such situation:

Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use.
We can randomly sample the data set. This means, we can create a smaller data set, let’s say, having 1000 variables and 300000 rows and do the computations.
To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we’ll use correlation. For categorical variables, we’ll use chi-square test.
Also, we can use PCA and pick the components which can explain the maximum variance in the data set.
Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option.
Building a linear model using Stochastic Gradient Descent is also helpful.
We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in significant loss of information.

Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.

You are given a data set on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier.

Why is naive Bayes so ‘naive’ ?

Naive Bayes is so ‘naive’ because it assumes that all of the features in a data set are equally important and independent. As we know, these assumption are rarely true in real world scenario.

Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?

Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would  be classified as spam.

Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word ‘FREE’ is used in previous spam message is likelihood. Marginal likelihood is, the probability that the word ‘FREE’ is used in any message.

You are assigned a new project which involves helping a food delivery company save more money. The problem is, company’s delivery team aren’t able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals.

This is not a machine learning problem. This is a route optimization problem. A machine learning problem consist of three things:

  1. There exist a pattern.
  2. You cannot solve it mathematically (even by writing exponential equations).
  3. You have data on it.

Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?


Low bias occurs when the model’s predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on an unseen data, it gives disappointing results.

In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate  a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).

Also, to combat high variance, we can:

  1. Use regularization technique, where higher model coefficients get penalized, hence lowering model complexity.
  2. Use top n features from variable importance chart. May be, with all the variable in the data set, the algorithm is having difficulty in finding the meaningful signal.

The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting- think of bias as being introduced at model selection).The variance is error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs

You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.

For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.

After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM(gradient boosting model) models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?

As we know, ensemble learners are based on the idea of combining weak learners to create strong learners. But, these learners provide superior result when the combined models are uncorrelated. Since, we have used 5 GBM models and got no accuracy improvement, suggests that the models are correlated. The problem with correlated models is, all the models provide same information.

For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners are built on the premise of combining weak uncorrelated models to obtain better predictions

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

How is kNN different from kmeans clustering?

Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.

kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as lazy learner because it involves minimal training of model. Hence, it doesn’t use training data to make generalization on unseen data set.

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

To check multicollinearity, we can create a correlation matrix to identify & remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.

But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.

When is Ridge regression favorable over Lasso regression?

You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

After reading this question, you should have understood that this is a classic case of “causation and correlation”. No, we can’t conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon.

Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can’t say that pirated died because of rise in global average temperature.

While working on a data set, how do you select important variables? Explain your methods.

Following are the methods of variable selection you can use:

  1. Remove the correlated variables prior to selecting important variables
  2. Use linear regression and select variables based on p values (???)
  3. Use Forward Selection, Backward Selection, Stepwise Selection
  4. Use Random Forest, Xgboost and plot variable importance chart
  5. Use Lasso Regression
  6. Measure information gain for the available set of features and select top n features accordingly.

What is the difference between covariance and correlation?

Correlation is the standardized form of covariance.

Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we’ll get different covariances which can’t be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.

Is it possible capture the correlation between continuous and categorical variable? If yes, how?

Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.

Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?

The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.

In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.

Random forest improves model accuracy by reducing variance (mainly). The trees grown are uncorrelated to maximize the decrease in variance. On the other hand, GBM improves accuracy my reducing both bias and variance in a model.

We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn’t. How ?

One hot encoding transforms categorical features to a format that works better with classification and regression algorithms.


Label encoding – normalize labels such that they contain only values between 0 and n_classes-1. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

Don’t get baffled at this question. It’s a simple question asking the difference between the two.

Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables. For example: let’s say we have a variable ‘color’. The variable has 3 levels namely Red, Blue and Green. One hot encoding ‘color’ variable will generate three new variables as Color.RedColor.Blue and Color.Green containing 0 and 1 value.

In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.

What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?


In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold as shown below:

  • fold 1 : training [1], test [2]
  • fold 2 : training [1 2], test [3]
  • fold 3 : training [1 2 3], test [4]
  • fold 4 : training [1 2 3 4], test [5]
  • fold 5 : training [1 2 3 4 5], test [6]

where 1,2,3,4,5,6 represents “year”.

You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

We can deal with them in the following ways:

  1. Assign a unique category to missing values, who knows the missing values might decipher some trend
  2. We can remove them blatantly.
  3. Or, we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others.

‘People who bought this, also bought…’ recommendations seen on amazon is a result of which algorithm?

The basic idea for this kind of recommendation engine comes from collaborative filtering.

Collaborative Filtering algorithm considers “User Behavior” for recommending items. They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information. Other users behaviour and preferences over the items are used to recommend items to the new users. In this case, features of the items are not known.


This image shows an example of predicting of the user’s rating using collaborative filtering. At first, people rate different items (like videos, images, games). After that, the system is making predictions about user’s rating for an item, which the user hasn’t rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in our case the system has made a prediction, that the active user won’t like the video.

What do you understand by Type I vs Type II error ?

Type I error is committed when the null hypothesis is true and we reject it, also known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it, also known as ‘False Negative’.

In the context of confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative (0) when it is actually positive(1).

You are working on a classification problem. For validation purposes, you’ve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?

In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesn’t takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also. Stratification is the process of dividing members of the population into homogeneous subgroups before sampling. The strata should be mutually exclusive: every element in the population must be assigned to only one stratum. The strata should also be collectively exhaustive: no population element can be excluded. Then simple random sampling or systematic sampling is applied within each stratum. This often improves the representativeness of the sample by reducing sampling error. Stratified sampling is not useful when the population cannot be exhaustively partitioned into disjoint subgroups. It would be a misapplication of the technique to make subgroups’ sample sizes proportional to the amount of data available from the subgroups, rather than scaling sample sizes to subgroup sizes (or to their variances, if known to vary significantly e.g. by means of an F Test).


You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?

Tolerance (1 / VIF) Variance Inflator Factor is used as an indicator of multicollinearity. It is an indicator of percent of variance in a predictor which cannot be accounted by other predictors. Large values of tolerance is desirable.  VIF provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity.

We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables. But, adjusted R² would only increase if an additional variable improves the accuracy of model, otherwise stays same. It is difficult to commit a general threshold value for adjusted R² because it varies between data sets. For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.

“R squared”, is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.

In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?

We don’t use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option.

Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical & horizontal movements.

Explain machine learning to me like a 5 year old.

It’s just like how babies learn to walk. Every time they fall down, they learn (unconsciously) & realize that their legs should be straight and not in a bend position. The next time they fall down, they feel pain. They cry. But, they learn ‘not to stand like that again’. In order to avoid that pain, they try harder. To succeed, they even seek support from the door or wall or anything near them, which helps them stand firm.

This is how a machine works & develops intuition from its environment.

I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?

We can use the following methods:

  • Since logistic regression is used to predict probabilities, we can use AUC-ROC (Area Under the Receiver Operating Characteristic curve) curve along with confusion matrix to determine its performance.The AUROC has several equivalent interpretations:The expectation that a uniformly drawn random positive is ranked before a uniformly drawn random negative.
    The expected proportion of positives ranked before a uniformly drawn random negative.
    The expected true positive rate if the ranking is split just before a uniformly drawn random negative.
    The expected proportion of negatives ranked after a uniformly drawn random positive.
    The expected false positive rate if the ranking is split just after a uniformly drawn random positive.Computing the AUROCAssume we have a probabilistic, binary classifier such as logistic regression.Before presenting the ROC curve (= Receiver Operating Characteristic curve), the concept of confusion matrix must be understood. When we make a binary prediction, there can be 4 types of outcomes:We predict 0 while we should have the class is actually 0: this is called a True Negative, i.e. we correctly predict that the class is negative (0). For example, an antivirus did not detect a harmless file as a virus .
    We predict 0 while we should have the class is actually 1: this is called a False Negative, i.e. we incorrectly predict that the class is negative (0). For example, an antivirus failed to detect a virus.
    We predict 1 while we should have the class is actually 0: this is called a False Positive, i.e. we incorrectly predict that the class is positive (1). For example, an antivirus considered a harmless file to be a virus.
    We predict 1 while we should have the class is actually 1: this is called a True Positive, i.e. we correctly predict that the class is positive (1). For example, an antivirus rightfully detected a virus.
    To get the confusion matrix, we go over all the predictions made by the model, and count how many times each of those 4 types of outcomes occur:lQ12T.png

In this example of a confusion matrix, among the 50 data points that are classified, 45 are correctly classified and the 5 are misclassified.

Since to compare two different models it is often more convenient to have a single metric rather than several ones, we compute two metrics from the confusion matrix, which we will later combine into one:

  • True positive rate (TPR), aka. sensitivity, hit rate, and recall, which is defined as TPTP+FNTPTP+FN. Intuitively this metric corresponds to the proportion of positive data points that are correctly considered as positive, with respect to all positive data points. In other words, the higher TPR, the fewer positive data points we will miss.
  • False positive rate (FPR), aka. fall-out, which is defined as FPFP+TNFPFP+TN. Intuitively this metric corresponds to the proportion of negative data points that are mistakenly considered as positive, with respect to all negative data points. In other words, the higher FPR, the more negative data points we will missclassified.

To combine the FPR and the TPR into one single metric, we first compute the two former metrics with many different threshold (for example 0.00;0.01,0.02,,1.000.00;0.01,0.02,…,1.00) for the logistic regression, then plot them on a single graph, with the FPR values on the abscissa and the TPR values on the ordinate. The resulting curve is called ROC curve, and the metric we consider is the AUC of this curve, which we call AUROC.

The following figure shows the AUROC graphically:

9NpXJ.pngIn this figure, the blue area corresponds to the Area Under the curve of the Receiver Operating Characteristic (AUROC). The dashed line in the diagonal we present the ROC curve of a random predictor: it has an AUROC of 0.5. The random predictor is commonly used as a baseline to see whether the model is useful. The maximum AUC is 1, which corresponds to a perfect classifier. Larger AUC values indicate better classifier performance.

  • Also, the analogous metric of adjusted R² in logistic regression is AIC (Akaike information criterion). AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.
  • Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.Deviance is a measure of goodness of fit of a model. Higher numbers always indicates bad fit.The null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean) where as residual with inclusion of independent variables.

How to decide if a sample comes from a population with a specific distribution?

The Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples.

Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?

You should say, the choice of machine learning algorithm solely depends of the type of data. If you are given a data set which is exhibits linearity, then linear regression would be the best algorithm to use. If you given to work on images, audios, then neural network would help you to build a robust model.

If the data comprises of non linear interactions, then a boosting or bagging algorithm should be the choice. If the business requirement is to build a model which can be deployed, then we’ll use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc.

In short, there is no one master algorithm for all situations. We must be scrupulous enough to understand which algorithm to use.

Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?

For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature. An ordinal variable is a categorical variable for which the possible values are ordered.

When does regularization becomes necessary in Machine Learning?

Regularization becomes necessary when the model begins to ovefit / underfit. This technique introduces a cost term for bringing in more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term. This helps to reduce model complexity so that the model can become better at predicting (generalizing).

What do you understand by Bias Variance trade off?

The error emerging from any model can be broken down into three components mathematically. Following are these component :


Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends. Variance on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.

OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.

OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value. In simple words, Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data.

How to verify that your data are normally distributed?

Normality Test Summary
Shapiro-Wilk Common normality test, but does not work well with duplicated data or large sample sizes.
Kolmogorov-Smirnov For testing Gaussian distributions with specific mean and variance.
Lilliefors Kolmogorov-Smirnov test with corrected P. Best for symmetrical distributions with small sample sizes.
Anderson-Darling Can give better results for some datasets than Kolmogorov-Smirnov.
D’Agostino’s K-Squared Based on transformations of sample kurtosis and skewness. Especially effective for “non-normal” values.
Chen-Shapiro Extends Shapiro-Wilk test without loss of power. Supports limited sample size (10 ≤ n ≤ 2000).

What is a latent variable?

Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.

Sometimes latent variables correspond to aspects of physical reality, which could in principle be measured, but may not be for practical reasons. In this situation, the term hidden variables is commonly used (reflecting the fact that the variables are “really there”, but hidden). Other times, latent variables correspond to abstract concepts, like categories, behavioral or mental states, or data structures.

One advantage of using latent variables is that it reduces the dimensionality of data. A large number of observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data. In this sense, they serve a function similar to that of scientific theories.

Common methods for inferring latent variables

  • Hidden Markov models
  • Factor analysis
  • Latent semantic analysis and Probabilistic latent semantic analysis
  • EM algorithms

Expectation Maximisation EM algorithm:

EM is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

L2 norm vs L1 norm

L2 norm or Least square is more sensitive to outliers. It is the sum of square errors while L1 is the sum of absolute error.


Bayesian Analysis: Advantages and Disadvantages

Bayesian methods and classical methods both have advantages and disadvantages, and there are some similarities. When the sample size is large, Bayesian inference often provides results for parametric models that are very similar to the results produced by frequentist methods. Some advantages to using Bayesian analysis include the following:

  • It provides a natural and principled way of combining prior information with data, within a solid decision theoretical framework. You can incorporate past information about a parameter and form a prior distribution for future analysis. When new observations become available, the previous posterior distribution can be used as a prior. All inferences logically follow from Bayes’ theorem.
  • It provides inferences that are conditional on the data and are exact, without reliance on asymptotic approximation. Small sample inference proceeds in the same manner as if one had a large sample. Bayesian analysis also can estimate any functions of parameters directly, without using the “plug-in” method (a way to estimate functionals by plugging the estimated parameters in the functionals).
  • It obeys the likelihood principle. If two distinct sampling designs yield proportional likelihood functions for , then all inferences about  should be identical from these two designs. Classical inference does not in general obey the likelihood principle.
  • It provides interpretable answers, such as “the true parameter  has a probability of 0.95 of falling in a 95% credible interval.”
  • It provides a convenient setting for a wide range of models, such as hierarchical models and missing data problems. MCMC, along with other numerical methods, makes computations tractable for virtually all parametric models.

There are also disadvantages to using Bayesian analysis:

  • It does not tell you how to select a prior. There is no correct way to choose a prior. Bayesian inferences require skills to translate subjective prior beliefs into a mathematically formulated prior. If you do not proceed with caution, you can generate misleading results.
  • It can produce posterior distributions that are heavily influenced by the priors. From a practical point of view, it might sometimes be difficult to convince subject matter experts who do not agree with the validity of the chosen prior.
  • It often comes with a high computational cost, especially in models with a large number of parameters. In addition, simulations provide slightly different answers unless the same random seed is used. Note that slight variations in simulation results do not contradict the early claim that Bayesian inferences are exact. The posterior distribution of a parameter is exact, given the likelihood function and the priors, while simulation-based estimates of posterior quantities can vary due to the random number generator used in the procedures.

Microsoft Research PhD Summer School 2016

I was lucky enough (thanks to Dr. Matteo Venanzi) to be invited to Microsoft’s summer school that takes place in Cambridge, UK, every year (4-8 July 2016). This is a one week school that aims to familiarise PhD students with the work at Microsoft. Moreover, students get the opportunity to present their work to other students, communication experts and Microsoft researchers. The benefit of this is twofold. First, students get feedback on their work and their presentation skills and style. Also, it is an opportunity to meet new people and measure the impact of your work in terms of the interest your poster attracts. At the same time, it is an opportunity to evaluate whether your work has any relevance to anything that Microsoft’s is currently working on. The school focuses on early PhD students but late students, like me, attended as well.

In terms of the everyday schedule, as I already mentioned, we had some coaching on how to communicate our research but we also had some more technical lectures as well. For example, I particularly liked Prof. Bishop’s presentation (“The Road to General Artificial Intelligence”). He focused on the advancements of AI in the recent years and that the potential is still greater.

Overall, I believe this yearly summer school is a great initiative from Microsoft Research and I am glad I was a part of it.


Group-pic-2-2 copy.jpg

AAAI (Association for Advancement of Artificial Intelligence) 2016 Conference

Phoenix, Arizona, USA, 2016

The Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) was held in February 12–17 at the Phoenix Convention Center, Phoenix, Arizona, USA. I was lucky enough to attend.

First of all, it was my first time in the USA and I have to admit that I was impressed. The buildings, the streets, the coffee, the food, the life!

Concerning the conference itself, it attracted a lot of people and celebrity academics working in the area of Artificial Intelligence. The first couple of days I attended interesting tutorials. From the traditional heuristic search to the trendy and hot topic of Deep Learning!! I will just mention that the deep learning tutorial was meant to be for students or people who were not familiar with the concepts. Even though they started from scratch, they managed to discuss about complex concepts by the end. So, the speakers presented what is included in their new book (check the book on Amazon) which is available for pre-orders at the time of writing this post.

On the days of the actual conference I had the opportunity to attend a number of talks about machine learning, security and multi-agent systems. I also attended the educational panel where Russell Stuart and Peter Norvig  announced the release of  the 4th edition of their famous book: Artificial Intelligence: A Modern Approach. The book will include material on Deep Learning and Monte Carlo Tree Search and meta-heuristics.

The evening of the first day of the conference I attended an invited talk from Andreas Krause who is currently leading the Learning & Adaptive Systems Group at the ETH University. He covered a lot of application themes as  the presentation’s title was From Proteins to Robots: Learning to Optimize with Confidence. The focus though was on Bayesian Optimization and how submodularity can be handy in such approaches in terms of providing guarantees and bounds on the solutions provided.

The next day, I attended another important talk. Demis Hassabis was there, the CEO of DeepMind which was acquired by Google for hundreds of millions back in 2014. There, it was announced that DeepMind’s AlphaGo will face the world champion in the game of Go in mid March (starting on March 9th 2016) and that it will be live streamed. For more details click here. Beside advertising and promoting DeepMind’s brain child, Demis went in to some details about the algorithms they used, giving us  intuition on how they approached the problem, why and how their algorithm works. Concretely, he said that contrary to people’s belief, not everything in Deepmind has to do with Deep Learning, but instead they rely on reinforcement learning which is an area inspired by behavioural psychology and our attempt to understand how brain works and ultimately how we learn. Put simply and briefly, reinforcement learning is about rewarding behaviour that is beneficial and penalising behaviour that is detrimental in some context.

My talk was just after lunch on the 3rd day of the conference. I was in the Computational Sustainability session and I only had a spotlight talk, which means I had to talk for 2 minutes in order to advertise my work and invite people to the poster session in the evening where I could present and talk more about my work. The session was interesting overall. The key speaker in my opinion was Pascal Van Henteryck. He and his students had 2 or 3 presentations in that session. It was about evacuation plans in cities that are vulnerable to flooding and natural disasters. Pascal is very well-known in the area and I personally have attended his online optimization course in Coursera.

The evening of that day I presented my poster in the allocated session. A lot of people stopped by and asked about my work and I am glad about it.

All in all, AAAI was a very good experience and I feel lucky to have the opportunity to be there! If you want to know more details about it or talk about specific papers please get in touch with me.