What is the "dirty secret about data?"

Bob · April 27, 2020, 3:24am

The “dirty secret about data” is that the data is always dirty.

The FCC data is self-reported by service providers, who have incentives to over-report.
The Census data comes from field surveys that cover much, but not all of every locality, and is subject to how respondents interpret the question. For example, there’s a question about whether there is a laptop or desktop computer in the home. Does this include laptops sent home from a school in a one-for-one program? We can’t tell from how the question was worded. In addition the Census aggregates data in specific ways, and its hard to answer a question about different combinations.
MLab Speed test data is also self-reported.
Almost all data sets have geographic data for which they are accurate. These become limitations when combining the data into a single view. For example, most Census data is available for Census Tracts, sometimes for Census block groups, but not for Census blocks. The FCC data does go down to the Census block level, but the closest proxy for demographic data in the block is the block group.

One of the goals in this tool is to remind you, the user, of where the data is approximate, and where heuristics are being applied. Nothing substitutes for your good judgement.