top of page

# Benford’s distribution and the Central Limit Theorem (CLT)

According to most statistic textbooks, the usual rule of thumb is that the CLT is reasonably accurate if n > 30 unless the data are quite skewed. This rule is probably wishful thinking, dating back to the pre-computer age (Chihara and Hesterberg 2019). We can generally obtain better approximations and visualize them using simulation-based methods such as simulating the MAD levels a thousand times or more. In our view, the best technical definition (Tilde, 2016) of the central limit theorem is: “When the distribution of any population has a finite variance, then the distribution of the arithmetic mean of random sampis is approximately normal if the sample size is sufficiently large.”

Given the central limit theorem, the distribution of the invoice samples should form a normal distribution, the mean of which can be taken as the approximated quantity and the variance used to provide a confidence interval for the Mean Average Deviation (MAD). Goodfellow, Bengio, and Courville (Goodfellow, Bengio, and Courville, 2016) have advanced the hypothesis that the central limit theorem allows us to estimate the estimate's confidence intervals using the cumulative distribution of the normal density.

This chapter investigated the central limit theorems with Benford's Distribution for experimental testing of up to 10,000 invoices. The purpose was to investigate the Central Limit Theorem's application for a large dataset created following Bedford's distribution. Then, we simultaneously obtained a statement for Bedford's Distribution, CLT, and LLN. Subsequently, methodological testing was described to assess the accuracy and goodness of fit test.

Later findings are presented in histograms showing n-1, n=10, and n=100 for k = 4000. The results confirmed that we did not get a normal distribution with only one sample (n=1), but a slightly skewed distribution due to high mean deviation. However, if we take ten or even 100 random samples, we see that the distribution becomes normal. This is the Central Limit Theorem (CLT), where we get a normal distribution with larger and larger samples (law of large numbers), even though our population is not normally distributed.

The theorem helps understand the relationship between the large dataset and a normal distribution of samples.

Drawing inferences from Benford's sample data, we can see that where individual datasets may fail to meet Bedford's law criteria, integrating the different data sets may result in a new behavior series characteristically nearer to Bedford's law.

On the other side, the importance of the Central Limit Theorem (CLT) cannot be undermined. It contributes vitally to gaining insights into the ubiquity of Benford's law. Datasets designed in line with Benford’s Law it is easier for the researcher to focus on the situations exhibiting values not concentrated in a small interval (Peng, 2019). Therefore, CLT allows the researcher to disregard the sample size chosen for the dataset not distributed normally. CLT is based on probability theory, emphasizing the appropriately normalized sum of independent random variables even if the individual variables are not normally distributed.

In this regard, CLT describes that the values of the variables can differ in a population in terms of distributions such as normal, left-skewed, right-skewed, and uniform, among others. Hence, the theorem explains that the mean sampling distribution will always be approximate to the normal distribution despite any population distribution.

CLT also highlighted that with increasing the sample size of the distribution, there is always a decrease in the sampling error. Hence, the theorem works on the three components: the population's successes, increased sample size, and population distribution. Under this approach, the sample size is crucial for determining the sampling distribution (Peng, 2019).

The application's scope is apparent in almost all disciplines, including physics, chemistry, astronomy, economics, engineering, and other disciplines requiring mathematical and statistical expressions. The result should manifest the quantitative skewness associated with the forward band phenomena in the natural sciences and real-life data. CLT allows us to understand that there is a high order of magnitude or enough variability in almost all datasets; therefore, it is impossible to obtain the data set, which is an exception to the given rule.

Hence, following the mathematical formula of multiplicative CLT, one can get two actual results. The first one is related to a dramatic increase in skewness, which is also in line with Bedford's behavior. The second is related to the increasing order of magnitude, which is also an essential criterion for Bedford's behavior.

The law of large numbers (LLN) is since the probability of an event E depends on the number of occurrences of that event. The probability is defined as the longer and frequency of the event in many trials, and m denotes the frequentist's definition of the probability of an event.

In the following figure, we randomly select n invoices from the population of 10,000 (invoices not normally distributed). In the first figure, we take 1 sample (n = 1); in the second, 10 samples (n = 10), and in the last 100 samples (n = 100).

With only one sample (n=1), we do not get a normal distribution but a slightly skewed distribution. However, if we take ten or even 100 random samples, we see that the distribution becomes normal. This is the Central Limit Theorem (CLT), where we get a normal distribution with larger and larger samples (law of large numbers), even though our population is not normally distributed.

In Figure 1 below, a single sample was withdrawn from the population of 4,000 invoices. Based on the results gathered from more than 200 simulations of n=1, the Mean Average Deviation (MAD) showed a slightly right-skewed distribution and not a normal distribution. It showed that by using the single-digit number in a large dataset of the invoice, the probability of normal distribution decreases due to large MAD values with the increase in some simulations ranging between 0.0205 to 0.0220.

It can further be depicted that the data points in the simulation reflect a slightly positively skewed histogram due to the deviation of the data point values from their MAD values. It shows that with the increase in the number of simulations, the distance between the data points and the MAD also increases due to high variation and low probability of the same number appearing again in the simulations.

The same discussion can be justified from a distance between the two data points: better-given diagrams such as 0.0205, 0.0210, 0.015, and 0.0220. The CLT further justifies skewness in the dataset by explaining the reason behind the symmetric normal distributions in the data side as the eventual distribution develops after several editions of random variables. It shows that small data samples are far from eventuality as the histogram does not show an excellent cover around the center, falling off almost unevenly on one side.

Figure 1: Benford’s law and the Central Limit Theorem with n = 1. Source: Franco Arda (2020).

Moreover, in Figure 2 below, a sample of 10 numbers was withdrawn from the population of 4,000 invoices. Based on the results gathered from more than 200 simulations of n=10, the Mean Average Deviation (MAD) depicted normal distribution. The distribution confirms that by using the sample of 10 numbers in a large dataset of invoices, the probability of normal distribution increases due to large MAD values with the increase in some simulations, which later returned to the low deviation values after reaching a peak number of simulations ranging between 0.017 to 0.020.

It can be analyzed that the data points in the simulation reflect slight distribution returning toward normal due to the reduced deviation of the data point values from their MAD values. It shows that with the increase in the number of simulations, the distance between the data points and the MAD decreases when the datasets are comprised of large numbers.

This behavior can be explained via the fact that due to reduced variation and high probability of the same number appearing again in the simulations. The same discussion can be justified from a distance between the two data points: better-given diagrams such as 0.017, 0.018, 0.019, 0.020, and 0.021. This figure strongly hints that the data has started even distribution on both edges to resemble the normal distribution of MAD.

Figure 2: Benford’s law and the Central Limit Theorem with n = 10. Source: Franco Arda (2020).

Like the results gathered from the sample (n = 10), in Figure 3 below, a sample of 100 numbers was withdrawn from the population of 4,000 invoices. Based on the results gathered from more than 300 simulations of n=100, the Mean Average Deviation (MAD) depicted normal distribution. The distribution confirms that by using the sample of 100 numbers in a large dataset of invoices, the probability of normal distribution increases due to large MAD values with the increase in several simulations, which later returned to the low deviation values after reaching a peak number of simulations.

This behavior can be explained by the CLT assumption that with large numbers in a dataset, the probability of deviation among the sample invoice values decreases, ranging from 0.006 to 0.010. Similarly, the distribution explains that the simulation's data points reflect a perfectly normal histogram due to the minimal deviation of the data point values from their MAD values.

It shows that with the increased number of simulations, the distance between the data points and the MAD decreases and reaches negligible. This is because of the reduced variation and increased probability of the same number appearing again in the simulations. The same discussion can be justified from a distance between the two data points: better-given diagrams such as 0.006, 0.007, 0.008, 0.009, and 0.010.

Through the application of CLT in these figures, we can understand how with the increase in the sample size, the histogram inclined towards showing the eventuality of data explained from the nice curves around the center that further falls off almost evenly in both the edges.

It is clear from the quantitative configuration of the data that CLT understood the criteria for Bedford's behavior. With the large sample size, there was a lack of an increase in skewness and increased focus on increased concentration around the center. Likewise, there was lacking of increasing order of magnitude beyond the existing maximum order of magnitude (Kossovsky, 2019).

Figure 3: Benford’s law and the Central Limit Theorem with n = 100. Source: Franco Arda (2020).

These figures must successfully help explain the application of Benford’s Law and CLT to a large dataset of invoices in this research.

However, explaining the spread can be examined in detail by applying the theoretical 's accusations and mathematical statements of the two approaches in the next section of the report.

In this section, the analysis of the spread of the distribution of invoices dataset is explained to answer several questions using Benford’s Law and Central Limit Theorem together. All three figures have confirmed the effectiveness of using CLT on the dataset satisfying Benford’s Laws conditions.

With an increase in sample size, there is always a chance to move toward normal distribution despite the individual datasets not exhibiting the normal distribution. This can be further explained by answering numerous questions.

Firstly, the researcher dealt with the question of identifying why the datasets behave similarly and to identify why the datasets behave differently. The answer to the similarity hypothesis can be well explained with the help of Benford's law as similarities in the leading digit or first digit of the numbers in dataset shapes there are similarities between them. In the histograms' detailed analysis above, figures 3 and 4 show similar behaviors as these datasets, with n=10 and n=100, were spread out on a logarithmic plot over several orders of magnitude.

The normal distribution is one of the significant components of the probability theory, known as Gaussian distribution, which helps understand the real-valued random variables. The significance of normal statistical distribution cannot be undermined as it helps represent real value trend variables within the natural and social sciences. The importance of normal distribution is associated with the central limit theorem.

Normal distribution states that physical quantities expected to be the sum of many dependent processes represent weirdly normal distributions. The Gaussian distributions' unique properties include their reflection from a distance combination of a fixed collection of normal deviates. The spread analysis of the invoices in this research is also visible from the mathematical support it has provided in understanding that under some conditions, the average of many samples of a random variable with finite mean and variance is a random variable.

Another property of such distribution is based upon the results and methods like propagation of uncertainty and least square parameter fittings that have allowed detailed analysis of the relevant variables normally distributed. Hence, a normal distribution with a bell curve is always helpful in understanding the situation.

The investigation results have confirmed that with only one sample (n=1), we do not get a normal distribution but a slightly skewed distribution due to the high mean deviation. However, if we take ten or even 100 random samples, we see that the distribution becomes normal. This is the Central Limit Theorem (CLT), where we get a normal distribution with larger and larger samples (law of large numbers) even though our population is not normally distributed.

The study has provided a detailed analysis of the theory behind normal distribution associated with many datasets, where each can be positively or negatively skewed. Understanding that the integration of CLT and Benford’s law is the real essence of the current research. Mathematical statements of both the theorem and law explained significant randomization processes associated with real-life data.

In the final note, the discussion in this chapter has explained CLT's great effectiveness for randomly sampling invoices from the population. Those findings can be used in the next chapter to quantify the normal distribution using a statistical method called the Shapiro-Wilk normalization test.