AI Wiki.

A beginner’s guide to important topics in AI.


Accuracy is the degree of closeness of predicted value to the actual value.It is denoted as, Accuracy=(TP+TN)/(TP+FP+FN+TN)


Activation functions are scalar-to-scalar function, yielding the neuron’s activation.They act like gate which determines whether the features from previous layer should pursue through the nodes.These activation function make complex boundary decisions for features by using a combination of weights and biases on input data


Adaptive Boosting is a machine learning meta-algorithm used in conjunction with other types of algorithm in order to improve the performance. They can be used for binary as well as multi-class classification without reducing into binary class problems.Adaptive Boost is used for feature selection, dimensionality reduction and thereby improving the execution time.


AdaDelta is an extension of AdaGrad in which it only keeps the most recent history rather than accumulating all the gradients for optimization.


AdaGrad is an algorithm for gradient based optimization.It is an adaptive learning rate method in which it adaptively uses sub-gradient methods to dynamically control the learning rate of an optimization algorithm.It never increase the learning rate beyond a base learning rate.


An algorithm for first-order gradient-based optimization in which the learning rates are derived from estimates of lower-order momentum.


Adjusted Rand Index is preferred when the partitions compared have different number of clusters.The adjusted random index has a maximum value of 1 when the clustering are identical and its expected value in the case of random labeling independently of the number of clusters is 0.


Area under the curve can be interpreted as the probability that a classifier model will rank a positive random example higher than a negative random example.A model whose predictions are 100% true will have an AUC equal to 1.0 and model with 100% wrong prediction will have an AUC of 0.0.


The science and engineering of making intelligent machines that can solve generic problems.It mainly aims at implementing human intelligence in machines and to create expert systems.


Association rules are used to uncover the relationship between the data that seems unrelated in a relational database.The Association rules are like if/then statements which consists of an antecedent and a consequent. Former is the item found in data and later is the item found in combination with antecedent. Association rule mining is a procedure which is meant to find patterns, correlations,associations from data found in various kinds of databases.Association rule uses the criteria such as support and confidence to identify the important relationships among the data.Support defines how frequently the if/then relationships appears in database and confidence indicates the number of times these relationships is found to be true.


Autoencoder neural network is an unsupervised learning algorithm used to learn compressed representation of datasets, typically for dimensionality reduction.The input is compressed into a latent-space representation and then output is reconstructed from this representation.Two common variants of autoencoders are compression autoencoders and denoising autoencoders.


Backpropogation is a method used in artificial neural network for reducing the error in neural network is often used with gradient descent optimization algorithm to adjust the weight of the neuron by calculating the gradient of the loss function.The error is calculated at the output and is propagated backwards through the network layers.


A text pre-processing algorithm used in NLP and for retrieving information.A text document is represented as an unordered collection of words and is mapped to an index of the sparse vector such that it can be processed by the ML algorithms . A sparse vector has an index for every words in the vocabulary. Hence the bag of word representation of text defines the number of occurrence of word in a document.


Biases are scalar values attached to the neurons and added to the input in which is used to adjust the output. They allow the network to try new interpretations or behaviour. Bias are modified throughout the learning process.


It is an extension of cubic interpolation for interpolating data points on a 2D grid.It guarantees the continuous first derivatives as well as cross-derivative but the second derivative could be discontinuous. Bicubic interpolation is chosen over bilinear interpolation in image processing as the images resampled are smoother and have fewer artifacts.


It is a tool used to identify the words that appear consecutively within a document. By calculating the frequency of words and their appearance in the context of other words the collocation is found and is then filtered to obtain useful terms. Then each n-gram of word is scored in accordance with the association measure to determine whether the n-gram is a collocation.


It is an extended form of linear interpolation for interpolating functions with two variables on a rectilinear 2D grid.At first the linear interpolation is performed in one direction and then in other direction. Bilinear interpolation is a re-sampling technique which finds application in image processing.


Balanced Iterative Reducing and Clustering using Hierarchies is a hierarchical clustering algorithm which performs clustering over large datasets.It can perform clustering incrementally without providing the whole dataset in advance.Is also has an added advantage because of its ability to efficiently use the full memory to derive the finest sub-clusters, thereby minimizing the I/O costs. The clustering can be achieved in three steps by building a clustering feature tree and then clustering the CF tree leaves by performing hierarchical clustering and the learned model can be used to cluster the observations.


BK-tree is a metric tree specifically adapted to discrete metric spaces which finds application in approximate string matching in a dictionary.The auto correct features in various software are are implemented based on this data structure.BK tree consists of nodes and edges.Every node in the BK tree have exactly one child with same edit distance and every insertion in the BK tree start from the root node.


Bootstrap is a technique for model validation in which statistical accuracy is assessed.This is the process of selecting datasets randomly from training data with replacement such that the sample size is same as the training set. This process is repeated until k number of bootstrap dataset is obtained.Then the model is refitted against each bootstrap datasets and the performance is examined.


Clustering Large Applications based upon Randomized Search is an efficient and effective algorithm for spatial data mining.It is a medoid-based algorithm in which a representative item or medoid is chosen for each cluster rather than the means of the items. CLARANS applies the strategy to search in a certain graph, to find the k medoids from n objects.A node in the graph represents a set of objects as selected medoids.Two neighbor nodes are randomly chosen in each iteration and a local optima is obtained if the choice is worse.CLARAN has got two parameters such as maxNeighbor and Number of local minima.Higher the value of maxNeighbor, closer is CLARANS to PAM(Partition Around Medoids).


It is a technique that visualizes the structure of distance-like data as a geometrical picture.Also known as principal coordinates analysis.When a input matrix of dissimilarities between pairs of items is given, MDS finds a set of points in the low dimensional space which approximates the dissimilarities.when the input matrix is of Euclidean type MDS is equivalent to PCA.


It is the algorithmic procedure of assigning a input object represented by feature vector into a category or class.For example, a classification model can determine whether an email belongs to a spam or non-spam class.


Clustering is the task of grouping observations with similar features into a cluster or subset.It is a common technique for statistical data analysis used in many fields such as machine learning, pattern recognition,etc.


Conditional generative adversarial network use class label information, allowing them to conditionally generate data of a specific class.In CGAN both the generator and discriminator are conditioned on some data y which can be class label or data from some other modality.


Conditional random field is a sequence modeling technique used to predict sequence of labels for sequence of input is a type of discriminative undirected probabilistic graphical model which applies the principle of logistic regression by extending the algorithm by applying feature functions as sequential inputs. CRF is usually trained by maximum likelihood learning.CRF is widely used in NLP.


A classification matrix is an error matrix which defines the performance of a classification model.It summarizes the performance in a table of two columns and two rows based on the values of true positives, true negatives, false positives and false negatives.


The main objective of Convolutional neural network is to learn the higher order features in the data via convolution.The CNN transform the input data through all connected layers into a set of class scores given by the output layer.CNN comprises of input layer, a feature extraction layer such as a convolution layer and a pooling layer and a classification layer.


Cover tree is a type of data structure designed to facilitate the process of nearest neighbor search especially in spaces with small intrinsic dimension.A cover tree on a dataset is a leveled tree where each level is indexed by an integer scale which decreases as the tree is descended.


Cross validation is a model validation technique for assessing the ability of a model to predict on new data.The model is first trained using the subset of the data set and then evaluated using the test data set .Multiple rounds of cross validation are performed using different partition of sample data in order to reduce the variability.Then the validation results are averaged over the rounds to obtain the effective model’s predictive performance.


Cubic spline interpolation is a special case of spline interpolation in which a smoother interpolating polynomial is obtained which has smaller error compared to other interpolating polynomial.Spine interpolation is a form of interpolation in which interpolant is a piece-wise polynomial called spline function.It uses a low degree polynomial for spline to reduce the interpolation error.


It is an algorithm used for generating date and time.Generally the date attributes are represented as long values which is considered insignificant in data mining.So for analysis features such as year,month day is extracted.


Density-Based Spatial Clustering of Applications with Noise is a density based data clustering algorithm which finds core samples of high density and expands clusters from them.DBScan requires two parameters in which the neighborhood radius and the number of minimum points required to form a cluster must be specified. It first analyse the neighborhood of a data point and if it meets the requirement, a cluster is formed and the neighborhood is also added, else it will be labeled as noise.The number of the clusters doesn’t need to know priori.


Decision tree builds classification and regression models in the form of a tree structure.It is learned by a process called recursive partitioning in which training sets are splitted based on an attribute value test.The tree is built from a root node and involves partitioning of data into subsets with similar values. The process is completed when the subset at a node has the same value of targetvariable.Decision tree can handle both categorical and numerical data.Hence the tree methods can be used for data-mining tasks.


Deconvolutional networks has multiple stacked deconvolutional layer where each layer is trained on the input of the previous layer.It performs an inverse convolutional model.It maps features pixels when modeling images, thus enables us to generate images as output from neural networks.


Denoising autoencoder is a variation of autoencoder. It is a technique used for feature selection and extraction when the input is corrupted. Denoising autoencoder solves the problem associated with corrupted data by randomly turning the input values to zero.Thus it reconstructs the data from an input of corrupted data.


DBNs comprises of layers of Restricted Boltzmann Machines which extracts higher-level features from raw input vectors and a feed forward network.The layers of RBMs are stacked for the pretrain phase and is trained in a greedy manner, then a feed-forward network is used for fine-tune phase.DBNs are used to recognize, cluster and generate images, video sequences and motion-captured data.


DCGAN is a variant of generative adversarial network which is used to generate new content.DCGAN architecture uses a CNN architecture on the discriminative model and for generator, convolutions are replaced with up-convolutions.


Deep Learning is a subset of machine learning. It deals with set of algorithms which is a combination of math and code that loosely resembles the structure and function of neurons in the human brain. Deep Learning can achieve Human-level accuracy on tasks like image recognition, voice recognition and predictive analytics. It is basically machine perception. Deep neural networks maps inputs to outputs byfinding correlation between two sets of data through some hidden layers between input and output layers which consists of some hyper-parameters.


Density CLUstering algorithm is a special case of Kernel Density Estimation used for clustering.Data points are clustered based on the local maximum of the estimated density function.The parameters associated with DENCLUE are sigma, which is a smooth parameter and parameter m specifies the number of samples used in the iteration. DENCLUE generally doesn’t work well on high dimensional data since data in high dimensional space looks uniformly distributed.


Deterministic Annealing is a technique used for non convex optimization problem of clustering.By extending the soft clustering to the annealing process, it avoids the concept of local minima of cost function.The annealing process starts with high temperature and for each iteration the centroids vectors are updated until convergence is reached. As the temperature is lowered the vector will split and the no of phases corresponds to the number of clusters.Further decrease of temperature beyond a critical value will leads to more splitting until the vectors are separate.


Dimensionality reduction is also known as feature extraction.It can be defined as the process of transforming data in high dimensional space to a fewer dimension space.Dimensionality reduction techniques can be linear as well as non-linear.


Dropconnect is a regularization technique in which randomly selected weights within the network are set to zero so that a network with better generalization capability can be obtained.


It is a regularization technique used to improve the training of neural network.A mechanism in which randomly selected subset of activations are set to zero within each layer during training so that network become less sensitive towards specificweights of neurons.Hence result in a network with better generalization and less likely to overfit the training data.


Feature selection in machine learning can be done by integrating ensemble learning methods such as random forest, gradient boosted trees and AdaBoost.An ensemble will combine the outputs of multiple models and the method importance will return the scores of the feature selection for which, higher is the score better the feature will be.Among other feature selection methods it has got a great advantage of its ability to handle stability issues .


Fallout or false positive rate is the Type I error.It is the ratio of incorrect positive predictions to the total number of negatives. FPR = (FP/N) = (FP/(FP+TN)) It can also be calculated as 1-specificity. The best FPR is 0.0 and the worst value is 1.0.


False discovery rate can be defined as expected proportion of TYPE I errors.It is defined as the ratio of false positive to the sum of false and true positives. FDR = (FP/FP+TP)


Features are individual measurable property of an event which is represented as a numeric feature vector. Feature Engineering is an essential part of building intelligent system. It is the process of creating feature vectors using domain knowledge of data that makes the algorithm work.


Feature selection is the process of selecting a subset of relevant features for building model.It enhances the performance of machine learning models by avoiding the curse of dimensionality and enhancing the generalization capability by reducing overfitting. Feature selection algorithm can be categorized as feature ranking and subset selection.Feature ranking is the method in which feature are selected based on a score whereas subset selection produces an optimal subset of features. For larger set of features subset selection can be achieved by implementing genetic algorithms.


A feed forward neural network is an artificial neural network in which the input signals flow in only one direction such as the connection between nodes do not form a cycle.


Fisher’s Linear Discriminant is a linear classifier that measures the ratio of variance for class labeling.It finds a linear combination of features that can be used for dimensionality reduction before later classification.This method projects high-dimensional data onto a line and classification is done in this single-dimensional space.


Frequent item set play a vital role in data mining tasks that finds interesting patterns from databases.This algorithm perform the task to find all the common sets of items,defined as those itemsets exists at least minimum amount of times.There are several algorithm which finds frequent itemsets by building prefix trees.One such algorithm is FP-growth algorithm.The idea behind the FP-growth algorithm is based on the recursive elimination scheme.


Traditional F-score can be defined as the weighted average of precision and recall.F1 score reaches its best value at 1 and worst at 0.A positive real β has been included to calculate in terms of the ratio of weighted importance on either recall or precision.


Gaussian process is a stochastic process for regression.It consists of a collection of random variable indexed by time or space with normal distribution.Gaussian process is an extension of multivariate Gaussian with an infinite sized collection of real valued variables.The reliability of the regression depends upon the covariance function. It can also be used as a prior probability distribution over functions in Bayesian inference for regression analysis.A machine learning algorithm with Gaussian process uses kernel function to measure the similarity between point or a matrix algebra is used to calculate prediction using the technique of kriging.


Generalized Hebbian algorithm is a linear feed forward neural network model for unsupervised is an adaptive method to find the eigen vectors of the covariance vector corresponding to largest k. it finds application in principal component analysis. The learning is a single layer learning process in which the changes in synaptic weight depends on the response of input and outputs of that layer. It has predictable trade off between learning speed and the accuracy of convergence which can be set by learning rate η.Practically the learning rate is set to a small constant value.


The generative network in GANs generate data with a special kind of layer called deconvolutional layer and the discriminator network evaluates discriminates between instances from the data.It is the discriminator that decides whether the instance of data belong to actual training dataset.


The Genetic algorithm is an heuristic optimization method used to generate solutions using techniques inspired by natural evolution.GA algorithm involves the process of fitness assignment, selection, recombination and mutation process for each individual and the best features are selected based on the value of selection error. It is one of the most advanced method used to select the most useful features from a large set of features but it requires a lot of computation.


G-Means is another extended variation of K-Means in which the number of clusters are determined based on the normality test.It takes a hierarchical approach to detect the number of clusters.G-means run k-means repeatedly with increasing value of k to test whether the data in the neighborhood of a cluster centroid look Gaussian and if not the cluster splits.


Gradient Boosting is a machine learning technique used for classification as well as regression which produces a prediction model in the form of an ensemble of weak prediction models.It is typically used in conjunction with decision trees.Gradient boosting involves a loss function to be optimized, a weak learner to make predictions and an additive model to add weak learners to minimize the loss functions. The models generalization capability can be improved by implementing regularization techniques such as choosing an optimal number of gradient boosting iterations, the shrinkage parameter that controls the learning rate and the sampling rate for stochastic tree boosting.


Gradient descent is a first order iterative algorithm used to update the parameters of the model.It finds the optimal values of parameters of a function that minimizes the loss function.


The Growing Neural Gas is an incremental algorithm where the information about the number of clusters will not be provided priori.It is capable of continuous learning and unlike neural gas, the parameters will not change over time.It can add or delete nodes during algorithm execution based on local error measurements.It produces a graph that describes the topology of trained data and each vertex corresponds to a neuron in which data has been mapped.


Another type of activation function in which the independent variables of magnitude greater than 1 is assumed as 1 and less than -1 as -1. The function can be mathematically expressed as, f(x) = { 1 (if x>1), -1, (if x<-1), x (otherwise)


A hidden Markov model is similar to a dynamic Bayesian network.It models the system by assuming it as a Markov process with hidden states.As like hidden Markov model the state will not be directly visible in HMM but the output, dependent on the state will be visible.Hence the sequence of output tokens obtained from HMM gives information about the states. It finds application in reinforcement learning and pattern recognition.


Hierarchical clustering is a clustering algorithm that build hierarchy of clusters like a tree either by splitting or merging them successively.The Agglomerative Clustering performs clustering in a bottom up approach in which each observation starts in its own cluster and later the clusters are successively merged together based on the linkage criteria.


Hyperparameter contains the variables that govern the training process.Hyperparameter optimization or tuning is the process of choosing a set of optimal hyperparameters for a learning algorithm such that it minimizes the loss function.


It is the process of estimating the value of a function for an intermediate value of the independent variable.Generally, it is the approximation of a complicated function by a simple function.


Isometric Mapping is one the earliest approach for manifold learning.Isomap is a widely used low dimensional embedding method which extends multidimensional scaling by incorporating the geodesic distances induced by neighborhood graph.Geodesic distance is the sum of edge weights along the shortest path between two nodes.The Isomap algorithm comprises of determining the neighborsof each point, constructing a neighborhood graph,computing the shortest path between two nodes and then computing the lower-dimensional embedding.The connectivity of data point in the neighborhood graph is defined as its nearest k Euclidean neighbors in the high dimensional space.


Kruskal’s multidimensional scaling is non-parametric relationship between the a non-metric dissimilarities MDS in which and the Euclidean distances between items, and the location of each item in the low-dimensional space is found using isotonic regression. The non-metric MDS algorithm comprises of finding the optimal monotonic transformation of proximities and the optimally arranging the configuration.


Principal component analysis can be implemented for non linear mapping of data by using kernel trick.Since large data set may yield a large kernel, the datasets are clustered and the kernel is populated with the means of those clusters.This may also lead to large kernel however only the top P eigen values and eigen vectors of kernel is computed.


A K-dimensional tree is a data structure used for organizing points in a space with k dimensions.It finds application in range search and nearest neighbor search.KD tree is a binary search tree where data in each node is a k-dimensional point in space which recursively partitions the parameter space along the data axes.A non leaf node in KD tree splits the space into two half spaces in which pints with smaller value than the node will be placed in the left subtree and larger to the right subtree.The process is repeated until the last trees are composed of only one element.


Keyword extraction is the process of extracting relevant keywords from the document that best describes the subject of the document. This algorithm relies on co-occurrence statistical information.


K-Means is a unsupervised learning algorithm which find applications in clustering problems.The algorithm work iteratively to assign each data point to one of the cluster based on the features.The observation with nearest mean belong to the same cluster. The number of clusters will be set priori and based on the centroids of clusters which hold the collection of feature values, the new data can be labeled. Finding an exact solution to the K-means problem is NP-hard, a standard approach to find an approximate solution is usually employed.


Kriging interpolation is a method of interpolation in which interpolated values are modeled by using Gauss-Markov Theorem based on the assumptions on covariances. They are implemented on data points which are irregularly distributed in space.Kriging can either be an interpolation or a fitting method.The main objective of kriging is to estimate the value of an unknown real-valued function.


L1 regularization adds a penalty equal to the sum of the absolute value of the coefficients so that network parameter can be shrinked to zero from getting too big in one dimension. L2 regularization adds a penalty equal to the sum of the squared value of the coefficients and force the parameters to be relatively small. Hence by adding a regularization term The network generalization ability is improved thereby preventing the overfitting issues.


A stemming algorithm in which a single table of rule is used for stemming the suffixes.Each rule specify the removal and replacement of an ending . Lancaster stemmer is very strong and aggressive and its implementation allows the user to use customized rules.


Laplace interpolation is a specialized interpolation method for retrieving missing data on a 2D grid. It generates the smoothest interpolant by solving a very sparse linear equations.


Laplacian Eigenmap uses a spectral technique to obtain the low dimensional representation of dataset by preserving the information about local neighborhood in a certain sense.It is insensitive to outliers and noise.The representation map can be viewed as a individual approximation to a continuous map generated from the geometry of manifold. As Laplacian Eigenmap only uses the local distances it is not prone to short circuiting.


Least Absolute Shrinkage and Selection Operator is a penalized regression method used in machine learning for selecting the subset of variables.As the name indicates, it is a shrinkage and variable selection method for linear regression models.The objective of regression is to obtain a subset of predictors and thereby minimize the prediction error. This can be achieved by imposing a constraint on to a model parameter that causes the regression coefficients of the variable to shrink towards zero and then the variables only with non zero regression coefficients are associated with the response variable.


It is defined as the number of neurons in a given layer.For input layer, the number of neurons will be equal to the number of features of input vector and for output, this will be either a single neuron or a number of neuron matching the number of predicted classes.


The issue of function being zero when it is less than 0 is mitigated by leaky relu.It will have a small negative slope of around 0.01.It can be denoted as f(x) = { x (if x>0), 0.01x (otherwise)


It is the rate at which the hyperparameters must be adjusted during optimization in order to minimize the error of neural network prediction.


Leave-one-out cross validation is one of the model validation technique in which a single observation from the original dataset is used for validation and the remaining as training set.The process is repeated until each observation is used once as a validation set.LOOCV is considered computationally expensive as it requires number of models equal to the number of training dataset.


It is basically an identity function where, f(x)=x. It implies that the dependent variable has a direct or proportional relationship with the independent variable.Therefore the input signal passes through the function unchanged.


LDA is based on Bayes decision is preferred when classification is to be done among multiple classes.It assumes a linear relationship between dependent and independent variable considering independent variables have equal variance in each class.LDA is commonly used as a dimensionality reduction technique as a pre-processing step in pattern recognition.The objective is to project feature space onto a lower-dimensional space with good class-separability in order to reduce computational costs.


Linear interpolation is the method of constructing new data points within the range of a discrete set of known data points using linear polynomials.It is also known as Lerp.Lerp is quick and easy but not precise and also the interpolant is not differentiable at the control points.


Linear Regression is a linear approach to model the relation between dependent and independent variables using linear predictor function.Ordinary Least Squares is a method for determining the unknown parameters of a linear regression model.The model specification is that the dependent variable is a linear combination of parameters. OLS chooses the parameters by the principle of leastsquares and thereby minimizing the sum of square of the difference between values predicted by the model and the true value of the dependent variable. The OLS technique can be applied on different frameworks depending on the nature of the data and the task to be performed.Once the model is constructed the goodness of fit of the model and the significance of the parameters estimated is confirmed.


Locally Linear Embedding has faster optimization compared to Isomap algorithm.It projects the data to a lower-dimensional space preserving the distance within the local neighborhoods.LLE comprises of stages such as computing the neighbors of each data points, weight matrix construction and embedding is encoded in the eigen vectors corresponding to largest eigen values.


Locality-sensitive hashing is an algorithm which finds application in nearest neighbor search used to identify the duplicate or similar documents.It performs probabilistic dimension reduction of data by hashing the input data points into a bucket such that the near data points are mapped in the same bucket while the data points that are far from each other are likely to be in different buckets.Thus making it easier to identify the observation with various degrees of similarity.


Logistic regression is a generalized linear model mainly used for binomial regression.The odds of a certain event occurring is predicted by converting a dependent variable into a logit variable and the logit of the probability of success is then fitted to the predictors.This can be used for categorical prediction by setting a cutoff value.


Loss function is the measure of how good a prediction model does in terms of being able to predict the expected outcome.It is the aggregate of the difference between actual and predicted output over entire dataset.


LSTM networks are recurrent neural network composed of LSTM units.A LSTM unit consist of a cell, an input gate, an output gate and a forget gate where the cell remember the value over arbitrary time interval and gates regulate the flow of information into and out of the cell.They are suited for classifying, processing and making predictions based on time series data.


Machine learning is an application of Artificial Intelligence which aims at getting machines, the ability to learn or generalize from experience without being explicitly programmed. The basic premise is to build a general model that enables to produce sufficiently accurate predictions on new cases. Machine learning Tasks :
➢ Supervised learning: A set of features and their labels will be fed and the objective is to learn from its mapping.
➢ Unsupervised learning: A set of features without any labels will be fed and the goal is to learn from its feature and discover hidden patterns in the data.
➢ Reinforcement learning: Learning by interacting with the environment in which goal is performed.


Manifold Learning is a non linear dimensionality reduction technique used to uncover the manifold structure in datasets in order to find a low-dimensional representation of the data.The idea for this algorithm is based on the concept that the dimensionality of datasets is only artificially high.Some prominent approach for manifold learning are LLE, Laplacian eigen maps, LTSA etc.


MaxEnt classifier is a discriminative classifier widely used in Natural Language Processing.It is a technique based on the principle of Maximum Entropy for learning probability distributions of data.As it does not make much assumptions about the features it can be used when there is no prior information about the data is available.In NLP applications the document is represented by a sparse array and determine whether the word exists in the document and categorize it to a given class.


Minimum Entropy Clustering is a iterative algorithm in which the clustering criterion is based on conditional entropy H(C|x).Conforming to Fano’s inequality the cluster label(C) can be estimated with minimum probability error if the conditional entropy is small.The criteria can be generalized by replacing it with Havrda-Charvat’s structural α -entropy.


MAD is the average of all residuals.It gives the average distance between each data point and the mean of the dataset.


It is the sum of the square of the difference between the predicted and actual target variable over all data points divided by the no of data points.


Model validations is a set of process intended to evaluate the goodness of fit of the model and the ultimate goal is to produce an accurate and credible model.


It is a gradient descent algorithm in which the learning rate depends on the derivative of current and preceeding step.It computes the gradient as an exponentially weighted average.


Multidimensional scaling is a set of related ordination technique used to visualize the similarities and dissimilarities in a dataset.It is a form of non-linear dimensionality reduction in which each item in the matrix that holds the item-item similarities are mapped to a low-dimension space.The major types of MDS algorithm include classical multi-dimensional scaling, metric multi-dimensional scaling, non-metric multi-dimensional scaling and generalized multi-dimensional scaling.


A Multilayer Perceptron Neural Network is an interconnected web of nodes which are called neurons(mathematical functions), and the edges that can join them together. A neural network’s main function is to receive a set of inputs to perform progressively complex calculations and the use the output to solve a problem. Neural networks develop algorithms on the basis of processing of human brains and builds model for complex patterns and prediction problems. The behavior of an artificial neural network can be decided based on both the weights and input-output function which is the activation function. This functions can be linear, threshold and sigmoid.The neural network with differentiable activation function can be trained using back propogation which will adjust the weights to reduce the error.


Mutual information score between two clustering is the measure of the similarity between two labels of the same data.i.e, it measures the dependency between two random variables.


Naive Bayesian is a probabilistic classifier. It is based on applying Bayes theory with strong independence assumptions among features.Naive Bayes often outperform other sophisticated classification methods when the dimensionality of the inputs are high.In general purpose classifiers, without any assumptions about the variable distribution user can fit any distribution of data with various distribution classes.Naive Bayes classifier can be used for document classification in NLP by setting up either multinomial model or Bernoulli model.


NLP is concerned with the development of applications that can process and manipulate large amount of natural data.It is a component of artificial intelligence.Natural language processing include speech recognition, natural language understanding, natural language generation and so on.


It keep the track of the previous layer’s gradient and use it for updating gradient.It first accumulates the previous gradient and then compute the gradient at the current point and make the correction to speed up the SGD.


Neural Gas clustering algorithm is similar to Self-Organizing map and it can be used for clustering related data based on feature vectors.It is an artificial neural network composed on N neurons in which the neurons tends to move around abruptly according to the distance of their reference vectors to the input signal during training, hence the name neural gas.The adaptation step of neural gas can be interpreted as gradient descent on a cost function.


Deep Learning is all about Neural Networks. Structure of neural network is like any other kind of network; there is interconnected web of nodes, which are called neurons(mathematical functions), and the edges that can join them together. A neural network’s main function is to receive a set of inputs to perform progressively complex calculations and the use the output to solve a problem. Neural networks develop algorithms on the basis of processing of human brains and builds model for complex patterns and prediction problems. Neural Networks are used for lots of different applications.


Normalization is the process of scaling the data to a standardized range.It is usually performed in the data pre-processing phase.Data normalization is the process of re-scaling the attributes to the range 0 to 1. it is useful when the data has varying scales. Data Standardization is the process of re-scaling the attributes so that they have mean 0 and standard deviation 1.


One Hot Encoding is the process by which the categorical integer features are converted into a sparse matrix that machine learning algorithm can work with.


It is a linear model used for binary classification with a simple input-output consists of a step function with a threshold value that outputs a real-valued single binary value depending on the input and associated weights.


An algorithm that takes a collection of sentences and extract all n-gram phrases corresponding to the MaxNGramSize.


An algorithm used for stemming the words for information retrieval.It is the process of reducing the derived words to the word stem in which the suffixes are removed automatically.It will reduce the size and complexity of data which is always advantageous.Porter stemmer algorithm is applied sequentially in five phases in which the conditions are tested in each phase and the suffix is removed accordingly as the rule fires.


A part of speech tagging is an algorithm that markup a word in the text corresponding to a particular part of speech based on the relationship with adjacent and related words in a corpus.


It is defined as the ratio of number of correct positive predictions to the total number of positive predictions.It is also known as positive predictive value or PPV. PPV = (TP/TP+FP) Prediction reaches its best value at 1.0 and worst value at 0.0.


Principal component analysis is a linear technique used for dimensionality reduction.It linearly transforms the data to a lower dimensional space such that it reconstructs the data with set of correlated variables into a smaller set of uncorrelated variables called principal components based on least-squarecriterion.It can also built to serve the purpose of data compression and to identify the potential clusters in data.


Probabilistic PCA is a technique used for dimensionality reduction using a latent variable model with linear relationship.PPCA algorithm is preferred to handle missing data..It can be expressed as the maximum likelihood solution of a probabilistic latent variable model.


Quadratic Discriminant Analysis is a classifier with a quadratic decision surface.QDA models the conditional probability density functions as a Gaussian distribution.Gaussian parameters can be estimated using machine likelihood estimation. Then posterior distribution can be used to obtain the class of the given data.In QDA the covariance of each of the class need not be identical.


RBF network uses radial basis function as activation functions and they mainly act as function approximators.It can be represented as an approximating functions which is the sum of radial basis functions associated with different center and weight coefficients.RBF networks can be trained in two step algorithm, by choosing the center vectors of the RBF functions in the first step and then a linear model with coefficients is simply fit to the hidden layers output.It can also be used for time series prediction and control.


Random Forest is an ensemble learning method for classification and regression.It consists many decision trees that can build classification and regression models and the outputs are obtained by accounting the majority votes of individual decision trees.The training set will be randomly sampled for growing the tree.The bagging idea and the random selection of features in order to construct a collection of trees with controlled variance, improves the stability and accuracy of machine learning algorithm.


Rand Index is the measure of the similarity between two data clusterings.It is defined as the number of pairs of sample in the same or different clusters divided by the total number of pairs of samples.Its value ranges from 0 to 1 where ) indicates the two data clustering do not agree on any pair of points and 1 indicates the clusterings are exactly same.


Random projection is a simple and less erroneous technique used for dimensionality reduction.The idea behind random projection is that the points of high dimension in a vector space can be projected to a less-dimensional space without distorting the distance between the points.Since this reduced dimension is too high the dimension is again reduced for the case of mixtures of Gaussians.Therefore, it is a promising dimensionality technique for learning mixtures of Gaussians.


Radial Basis Function is a primary tool for interpolating multidimensional scattered data.RBF is a real valued function whose value depends on the distance from origin.RBF interpolation can be represented as an approximating function which is the sum of the N radial basis function associated with different center and weight coefficients.Commonly used radial basis functions are Gaussian, Multiquadric, Inverse quadratic,Inverse multiquadratic, Polyharmonic spline,Thin plate spline.


Recall is the ratio of relevant instances that have been retrieved to the total number of relevant instances. In information retrieval area sensitivity is called recall.


Receiver operating characteristic is a graphical representation created by plotting the true positive rate against the false positive rate at different threshold values It illustrates the performance of a binary classification models at different classification thresholds.


It belongs to the family of feed forward neural networks in which the information are send over time-steps.RNN draws each vector from a sequence of input vectors and model them one at a time thus allowing the network to retain its state while modelling.RNN can be used to produce predictive result for sequential data where the information cycles through a loop and takes decision considering the current inputs and the information obtained from previously learned inputs.


Recursive autoencoder takes a sequence of representation vectors and reconstruct the input so that a reduced dimensional representation of that sequence is obtained.Recursive autoencoder can be used to split sentences into segments for NLP.


Recursive neural network is composed of a shared-weight matrix and a binary tree structure that allows the network to learn varying sequences of texts or image.It can be implemented as a sentence and scene parser.


Recursive neural tensor network is a supervised neural network that computes the supervised objective at each node of the tree.The tensor is used to calculate the gradient using a matrix of three or more dimensions.Recursive neural tensors can be used to break up an image into its composing objects and label the objects semantically.


Regression models are used to make predictions from data by understanding the relations between data and some observed,continuous-valued response.Regression is all about predicting a continuous quantity by approximating a mapping function from input variable to continuous output variable. It is usually used in applications like stock price predictions


Like decision trees regression trees can be learned by performing recursive partitioning.It is a decision tree for regression. The main problem associated withregression and classification tree is their high variance . But it can handle both numerical and categorical data.


Regularization is a hyperparameter that help to modify the gradient to minimize the overfitting problems.Therefore, it is a measure taken against overfitting so that the neural network can generalize well on over all new inputs.


In LDA and FLD the small eigen values will be sensitive to choose exact training data. There arises the need of RDA that regularizes the covariance matrix of each class and allows the covariance of QDA to shrink towards a common variance as in LDA.The regularization factor α determines the complexity of the model. When α is one RDA is equivalent to QDA and when zero it is equivalent to LDA.


Relevance ranking algorithm finds its application in text indexing and retrieval is a method used to order the results list in such a way that the most relevant record will be listed first.


An activation function in which neuron is activated when the input is above a threshold value in which the independent variable has a linear relationship with dependent variable.While the input is below zero the output is approximated as zero.Mathematically expressed as, f(x) = { x (if x>0), 0 (otherwise)


A residual sum of square is a statistical technique used to measure the variance in the dataset which is not explained by the regression model.It can also be defined as the measure of the discrepancy between regression function and the dataset.


Restricted Boltzmann machine are mainly used in deep learning for feature extraction and dimensionality reduction.RBM consists of visible layer and hidden layer in which they are connected by connections with associated weights and no units of same layer are connected.RBMs are used for pretraining layer in large networks to reconstruct the original data from a limited set of sample.


Ridge Regression is a technique for analyzing multiple regression data which possess multicollinearity in which one predictor variable can be linearly predicted from others with a considerable accuracy.It provides a regularization method to ill-posed problems by shrinking the coefficients by adding a degree of bias to the regression estimates.


RMSprop is a gradient based optimization algorithm which resolves the issue of diminishing learning rates in AdaGrad.It divides the learning rate by an exponentially decaying average of squared gradients.


Root mean square is the square root of mean squared error.It aggregates the magnitude of the errors associated with predictions into a single measure of predictive power.RMSE is useful when large errors are particularly undesirable.


Sammon’s mapping is an iterative algorithm used for multidimensional scaling.It projects the high dimensional space into a low dimensional space by retaining the structure of inter-point distance as in high dimensional space.The Sammon’s mapping can also be used to isometrically project an object into a lower dimensional space and in other case it can be used to project it down to reduce the distortion in the inter-point distances and thereby limit the change in the topology of the object.


Sensitivity or true positive rate is the ratio of the correct positive to the total number of positives. SN = (TP/TP+FN) The best sensitivity is 1.0 and worst is 0.0.It is the statistical measure of the performance of binary classification test.


In NLP tasks the input text has to be divided into sentences.This algorithm is a simple sentence splitter for English that identifies the boundary of sentences and return the text as list of string where each string corresponds to a sentence.


Sequential Information Bottleneck algorithm is a technique used to cluster the co-occurrence data such as text documents vs words.It randomly draws a document from the cluster and finds a new cluster for it by minimizing a merging criterion.It is preferred to employ unweighted Jensen-Shannon divergence as the criterion.


It is a special case of normalized radial basis function interpolation.It is developed for interpolation to arbitrarily spaced discrete bivariate data.Shepard interpolation is widely used because of its simplicity and also It is fast ,simple and perform efficiently for quick and dirty applications.


An activation function which is a special case of logistic function in which the extreme values and outliers in the data can be reduced without removing them. Sigmoid function converts the independent variables of near infinite range into simple probabilities between 1 and 0. The function can be expressed as, f(x) = 1/(1+exp(-x))


It is defined as the ratio of signal strength to the noise, where signal strength is characterized by the difference in class-conditional means ad noise as the difference in class-conditional standard deviations. i.e, SNR= |μ 1 - μ 2 | (σ 1 + σ 2 ) class2, σ 1 and , where μ 1 and μ 2 are the mean value of the variables in class 1 and σ 2 are their standard deviations. SNR is a feature ranking metric which can be used as benchmark for feature selection in binary classification. Larger the value of SNR better the features for classification.


Softmax function is an activation function which can be applied to continuous data.It can contain multiple decision boundaries and can handle multinomial labeling systems.Softmax function is often used at the output layer of a classifier.It can be represented as, f(x_i )=exp(x_i )/(∑( j=0 to k) exp(x_j ), i=0,1,2,..,k


An activation function which is a smooth version of RELU.It also overcomes the issue of Dying RELU by making itself differentiable everywhere and causes less saturation overall.Mathematically expressed as, f(x) = log(1+e^x)


An activation function for neural networks as an alternative to hyperbolic tangent.The softsign function converges polynomial whereas tanh converges exponentially.It can be mathematically expressed as, f(x) = x / (1+|x|)


Specificity(SP) or true negative rate evaluates the performance of binaryclassification by calculating the ratio of correct negative to the total number of negatives. SP = (TN/TN+FP) Specificity has its best value at 1.0 and worst at 0.0.


Spectral Clustering is a data clustering algorithm which makes use of the eigen values of the similarity matrix of the data to perform dimensionality reduction.The main objective of spectral clustering is to cluster connected data.The spectral clustering algorithm works by obtaining the data representation in the low-dimensional space that can be easily clustered.


It is a stochastic approximation of gradient descent optimization to optimize the loss function by assuming the batch size as one.As it is a stochastic approximation a single example is selected randomly to calculate the gradient at each iteration.


Sum squares ratio can be used as a feature selection criterion for multi-class problems. It is a uni-variate feature ranking metric which can be defined as the ratio between groups to within groups.


Support Vector Machines is an algorithms used for classification analysis. SVM can perform linear classification by choosing a hyperplane that acts as a margin between two classes.In addition to that it can perform non-linear classification by using kernel trick which allows the algorithm to fit the maximum-margin hyperplane in a high dimensional feature space. The effectiveness of SVM depends on the Selection of kernel function, its parameters, and soft margin penalty parameter. Multi-class SVM can be created by reducing it to multiple binary classification.


Support Vector Regression is used as a regression method that uses the same principle as SVM for classification.It is an effective tool to estimate real value function by setting a margin of tolerance ε approximately.Like SVM it also uses kernel trick for implicit mapping and the potency of the model also depends on the ε,loss function error threshold .


It is an activation function used in deep learning.Tanh is a hyperbolic trigonometric function defined by, f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x)) Tanh have a normalized range between -1 and 1 and therefore it can deal with negative numbers.


Assuming the text is splitted into sentences tokenizer segments the sentences into individual tokens.i.e, it receives a stream of characters and outputs a stream of tokens.


The t-distributed stochastic neighbor embedding is a non-linear dimensionality reduction technique in which data can be embedded into a lower dimensional (2D or 3D)space, which can then be visualized in a scatter plot. The dimensionality reduction is done in such a way that the similar objects are modeled by nearby points and dissimilar objects by distant points. The t-SNE algorithm comprises of two stages such as constructing a probability distribution over pairs of high-dimensional objects and then defines a similar probability distribution in the low dimensional map.


Variational autoencoder consists of a encoder, decoder and loss function where, encoder and decoder are neural nets and loss function is the negative likelihood with a regularizer.They use a variational approach for latent representation learning makes them useful for generative modelling.


Data can be loaded using a data-preprocessing library which makes building of data pipelines easier. It converts the data which is loaded into a format that neural networks can understand. It is designed to support all major types of input data whether it can be text data, CSV, audio, image or video.


Weights are coefficients that scale the input signal to a given neuron in the network.It can also be defined as a coefficient of feature in a linear model or the edges in a deep network.The value of weight determines whether the feature contribute to the model.


X-Means clustering is a variation of K-Mean clustering algorithm in which the number of clusters are determined automatically based on Bayesian Information Criterion(BIC) score.The clustering process starts with a single cluster and is repetitively partitioned, keeping the optimal resultant splits until the criterion is reached.