Using Python, Pandas, Scikit-learn and Bokeh I will explore a 5+ GB data set of publicly available balance sheet data covering all Federal Deposit Insurance Corporation (FDIC) regulated U.S. banks from Q1 1992 through Q4 2015. The objective will be to understand how to measure the size of a bank and to model the distribution of bank size.
Specifically, I will develop a few different measures of bank size and then I will assess the statistical support for Zipf's Law (i.e. a the power law distribution with a scaling exponent of roughly α=2) as an appropriate model for the upper tail of the size distribution of U.S. banks. Although I will find statistically significant departures from Zipf's Law for most measures of bank size in most years, a power law distribution with α = 1.9 out performs other plausible heavy-tailed alternative distributions.