📊 Key Concepts & Foundations
Algorithm Complexity
Big-O notation measures worst-case growth: \(O(1)\) constant, \(O(\log n)\) logarithmic, \(O(n)\) linear, \(O(n \log n)\) linearithmic, \(O(n^2)\) quadratic, \(O(2^n)\) exponential. Crucial for scalable systems.
Machine Learning Fundamentals
Supervised learning minimizes a loss function: \(L(\theta) = \sum \ell(y_i, f(x_i;\,\theta))\). Gradient descent updates: \(\theta \leftarrow \theta - \alpha\,\nabla L\). Bias-variance tradeoff governs generalization.
Neural Networks
Layers of neurons: \(\mathbf{z} = W\mathbf{x} + \mathbf{b}\), \(\mathbf{a} = \sigma(\mathbf{z})\). Backpropagation computes \(\partial L / \partial W\) via chain rule. Universal approximation theorem: one hidden layer can approximate any continuous function.
Fourier Transforms
$$F(\omega) = \int_{-\infty}^{\infty} f(t)\,e^{-i\omega t}\,dt$$ Decomposes signals into frequency components. DFT for discrete data: \(O(n^2)\); FFT algorithm achieves \(O(n \log n)\).
Numerical Methods
Newton's method: \(x_{n+1} = x_n - \dfrac{f(x_n)}{f'(x_n)}\). Euler's method for ODEs: \(y_{n+1} = y_n + h\,f(t_n, y_n)\). Simpson's rule for integration. Stability and convergence are paramount.
Monte Carlo Simulation
Use random sampling to estimate quantities: \(\pi \approx 4 \times \dfrac{\text{points in circle}}{\text{total points}}\). Law of large numbers guarantees convergence. Error scales as \(1/\sqrt{N}\).
Bayesian Inference
$$P(\theta|D) = \frac{P(D|\theta)\,P(\theta)}{P(D)}$$ Prior beliefs updated by data likelihood to form posterior. Conjugate priors simplify computation. MCMC for complex posteriors.
Information Theory
Entropy: $$H(X) = -\sum p(x)\,\log_2 p(x)\;\text{bits}$$ Mutual information: \(I(X;Y) = H(X) - H(X|Y)\). KL divergence measures distribution difference. Shannon's channel capacity theorem.
✎ Problems
You have three sorting algorithms: Bubble Sort \(O(n^2)\), Merge Sort \(O(n \log n)\), and Radix Sort \(O(nk)\). For a dataset of \(n = 1{,}000{,}000\) integers each with at most \(k = 7\) digits, estimate the number of operations each algorithm requires. Which is fastest? At what value of \(n\) would Bubble Sort take approximately 1 year on a computer performing \(10^9\) operations per second?
Show Hint
A binary classifier has the following confusion matrix on 1000 test samples: True Positives = 180, False Positives = 40, True Negatives = 710, False Negatives = 70. Calculate accuracy, precision, recall, F1-score, and the false positive rate. If the cost of a false negative is 10x that of a false positive, should you adjust the decision threshold up or down?
Show Hint
Compute the entropy of a source that emits symbols \(\{A, B, C, D\}\) with probabilities \(\{0.5, 0.25, 0.125, 0.125\}\). Design an optimal prefix-free binary code (Huffman code) for this source. What is the average code length? Compare it to the entropy. Can you explain why they match exactly in this case?
Show Hint
Monte Carlo Estimation: You want to estimate the integral \(I = \displaystyle\int_0^1 e^{-x^2}\,dx\) using Monte Carlo sampling with \(N\) random points uniform on \([0,1]\). Write the Monte Carlo estimator. If \(N = 10{,}000\) samples yield a mean of \(0.7462\) with standard deviation \(0.1283\), construct a 95% confidence interval. How many samples would you need to reduce the margin of error to \(0.001\)?
Show Hint
Use Newton's method to find the root of \(f(x) = x^3 - 2x - 5\) starting from \(x_0 = 2\). Perform three iterations. What is the order of convergence? Compare the error at each step to verify quadratic convergence. The exact root is approximately \(2.09455148\).
Show Hint
Bayesian Reasoning: A medical test for a rare disease (prevalence 1 in 1000) has 99% sensitivity (true positive rate) and 99% specificity (true negative rate). If a patient tests positive, what is the probability they actually have the disease? Explain why this result surprises most people. What prevalence would make a positive test result 50% reliable?
Show Hint
A single-layer neural network (perceptron) with sigmoid activation \(\sigma(z) = \dfrac{1}{1+e^{-z}}\) is trained on 2D data with weights \(\mathbf{w} = [0.5,\,-0.3]\) and bias \(b = 0.1\). For input \(\mathbf{x} = [2, 3]\), compute the forward pass output. Then, if the true label is \(y = 1\) and we use binary cross-entropy loss \(L = -[y\ln(\hat{y}) + (1-y)\ln(1-\hat{y})]\), compute \(\partial L/\partial w_1\), \(\partial L/\partial w_2\), and \(\partial L/\partial b\) using backpropagation.
Show Hint
Fourier Analysis: A signal \(f(t) = 3\sin(2\pi \times 50\,t) + \sin(2\pi \times 120\,t)\) is sampled at \(1000\) Hz for 1 second (\(N = 1000\) points). What frequencies will appear in the DFT? What is the frequency resolution? If the sampling rate were reduced to \(200\) Hz, what happens to the 120 Hz component? Explain aliasing and the Nyquist theorem in this context.
Show Hint
Compare Euler's method, the midpoint method (RK2), and the classical Runge-Kutta (RK4) method for solving \(dy/dx = -2y\) with \(y(0) = 1\) (exact solution: \(y = e^{-2x}\)). Use step size \(h = 0.1\) and compute \(y(0.5)\). Calculate the absolute error for each method. How does the error scale with \(h\) for each method?
Show Hint
Bias-Variance Tradeoff: A polynomial regression model of degree \(d\) is fit to noisy data generated from a true quadratic function \(y = 2x^2 - x + 1 + \varepsilon\) where \(\varepsilon \sim \mathcal{N}(0, 0.5^2)\). Qualitatively describe the bias and variance as \(d\) increases from 1 to 20. At \(d = 1\), what is the nature of the error? At \(d = 15\) with 20 data points? What criterion (AIC, BIC, or cross-validation) would you use to select the optimal \(d\), and why?
Show Hint
KL Divergence & Information: Consider two probability distributions: \(P = \{0.3, 0.4, 0.2, 0.1\}\) and \(Q = \{0.25, 0.25, 0.25, 0.25\}\) (uniform). Compute the KL divergence \(D_{\text{KL}}(P \| Q)\) and \(D_{\text{KL}}(Q \| P)\). Show that KL divergence is not symmetric. Then compute the mutual information \(I(X; Y)\) for two binary variables where \(P(X=0,Y=0) = 0.4\), \(P(X=0,Y=1) = 0.1\), \(P(X=1,Y=0) = 0.2\), \(P(X=1,Y=1) = 0.3\).
Show Hint
MCMC & Bayesian Computation: You observe data \(x_1 = 3.2\), \(x_2 = 2.8\), \(x_3 = 3.5\), \(x_4 = 3.1\) from a normal distribution \(\mathcal{N}(\mu, 1)\). Using a prior \(\mu \sim \mathcal{N}(0, 10^2)\), derive the posterior distribution analytically. Then explain how the Metropolis-Hastings algorithm could be used to sample from the posterior when no closed-form solution exists. What is the acceptance probability formula? Why is a "burn-in" period necessary?