Vic's Blog

Introduction to RNNs and LSTMs

This post explores the basics of vanilla recurrent neural networks (RNNs) and long short-term memory networks (LSTMs). What neural architecture can effectively handle sequences? Consider that we want a neural network to do a sequence-processing task, such as: generating human names image captioning sentiment classification of text What kind of neural architecture would be suited for these types of problems? The problem with dense neural networks Diagram of a dense neural network for reference....

Entropy-Based Loss Functions

Entropy-based loss functions, such as cross-entropy and Kullback-Liebler divergence, are commonly used in deep learning algorithms. This post is a summary of my understanding of them. What is information? Since the definition of entropy involves the concept of information (in information theory, according to Claude Shannon), I found it helpful to better understand what information is, in order to better understand entropy. Information can be interpreted as a measure of “surprise”....

Central Limit Theorem Proof

I am documenting my understanding of the Central Limit Theorem, because it seems so foundational for statistics, which seems so foundational for machine learning theory. For example, the theorem is the reason we can assume that noise/error is normally distributed in many situations. An assumption of this post is that you have some knowledge of elementary calculus and statistics. But I will try to remind you of some important things from those topics to aid in understanding, just in case they were forgotten....

Mean-Squared Error Justifications

Mean-squared error (MSE) is commonly used in loss functions for regression problems. What justifies its usage? What are we actually doing when we minimize the MSE? We will address these questions from a few different perspectives. $MSE = \frac{1}{2N}\sum_{i=1}^{N}(y_{i} - \hat{y_{i}})^2$ A maximum likelihood perspective Givens and assumptions: $X$: set of inputs {$x_{i}$} for all $i$ $Y$: set of observed outputs {$y_{i}$} for all $i$ $y_{i} = h(x_{i}) + e_{i}$Where h is some hypothesis function, and $e_{i}$ is normally-distributed noise Aside: Why can we assume that noise is normally distributed?...