Data organization


Teaching data literacy: Module 2

Data organization

Data organization process

Data organization process 4step.png

Data organization has four or five steps, depending on whether or not you treat sourcing and collection separately or together. At larger scale, each tends to become more complex. In most classroom applications, you can treat them together as I have done here. The four steps are:

  1. Data identification
  2. Data storage
  3. Data sourcing and collection
  4. Data preparation

The process of organizing data is analogous to the preparatory work a restaurant does in its kitchen before it opens to customers. The work involved will strike some students as boring and tedious, and they won't be wrong in their assessment; but it is an absolutely vital part of the process. The computer science aphorism "garbage in, garbage out" refers very specifically to the quality of data one uses in statistical analysis. It must be done correctly for any project to work. In an organizational setting, each stage of the process requires specialized skills. To show students that even tedious work has its rewards, we begin this module with a brief video that describes career opportunities in the exploding field of data science. This may not be relevant for elementary school students, but it is valuable to share with secondary school students who may consider a career as a data scientist as they begin to think about their plans for higher education. Here is the video.

Finally, to understand the magnitude of the problem we face as a society in raising students' understanding of data, please read the following academic article by Dan Zalles of SRI International's Center for Technology and Learning that reports on assessments of students' data literacy.

For more articles visit the Course library

Step 1: Data identification

Learning outcome
Groups identify data required for analysis
Data comes in many forms and formats. While much of it is quantitative, just as much or more is physical. In many instances quantitative data used in statistical analysis comprises measurements of physical objects or subjective judgments of experience. For example, paleontologists who study dinosaur fossils must perform many physical and chemical analyses of such fossils to produce quantitative data that they can then analyze using statistics. Opinion surveys that ask consumers to rate a shopping experience, a service, or a vacation, may ask those consumers to convert their feelings to a number. "On a scale of 1-10, with 10 being the best, how would your rate the service you received at a restaurant?" So while most data eventually ends up in a quantitative form, much of it does not start out that way.
Task (A) immediately below contains an abbreviated list of data types that students might encounter in an academic setting. (B) and (C) explain what they should do to document each item they need for a project.
A. Identify all the data types required for each problem
1. Quantitative data
a. collected by polls, surveys, or interviews
b. downloaded from online sources
c. computed from other data
d. live feed (advanced students only)
2. Textual data
a. paper documents
b. electronic files
3. Image data
a. physical
b. electronic files
B. Describe each item of data required for each step of a problem in its Problem Process Table.
C. List the format of each item of data to be collected, from the choices in (A) above, in the Problem Process Table.

Step 2: Data storage

Learning outcome
Groups decide how they will store their data
Before it can be analyzed, data must be stored. The method of storage depends on several factors listed below. The storage decision also impacts the use of each data item in the analysis covered in Module 3 of this mini-course. For example, if the final result of such analysis is the production of standard charts and graphs, it may make sense to store data in spreadsheet with the capability to produce those graphs. If on the other hand, the purpose of the analysis is to produce a standard set of reports on a schedule with data that changes over time, then a database is the best option. For simple exercises that one might conduct with elementary school students, a notebook or just paper may be sufficient.
A. Determine variables affecting storage medium
1. Quantity of data
2. Form in which data will be stored
a. paper
b. electronic file
c. database
3. Duration of usage
4. Expertise of students with different storage options
B. Record medium for each data item in Problem Process Table

Step 3: Sourcing and collection

Learning outcome
Groups identify sources of data and collect it
Just as data comes in many forms, so too does it come from many sources. Whether a project requires analysts to collect data only one, or repeatedly, it is critical to identify the source and verify its reliability. This is especially important in situations of repeated use. For example, imagine a sports reporter who must report the results of daily sporting events shortly after they end. Nowadays reporters rely on electronic sources for updated statistics on the athletes they cover and they expect those statistics to be updated almost instantaneously. A much more serious application is our air traffic control system. Travelers who must change flights en route to a destination are monitored by their airlines to determine if they will arrive in time to make their flight connections. Airlines use that real time location data to know if passengers will make their connection flights or if the airline must re-book those passengers in later flights and resell their original seats to other passengers. Sourcing data may seem boring, but it is absolutely critical to the entire process.
A. Identify data sources by type:
1. collected by students
2. non-electronic public sources
3. electronic public sources
4. private sources
B. Record data source for each data item in Problem Process Table
C. Collect and store data

Step 4: Data preparation

Learning outcome
Students clean and verify data
The final step in the process of organizing data is to verify it and, if necessary, repair it. Data, especially when it is quantitative and delivered electronically, seems like it must be fine. Often it is not. In the field of data science, bad data is called "corrupt." While much data is generated by computer programs that make few errors, much is still entered into some system by human beings who make lots of mistakes. Data entry is not interesting or fun, but often it is the only way that information can be put into a form in which it can be analyzed. Even machine generated data can become corrupted if it passes from one system to another. When that happens, the receiving database must know exactly how to interpret what it receives from the sending database. Often this does not work perfectly. K-12 data science projects with finite and relatively small data sets provide a good opportunity for future data scientists to familiarize themselves with the skills required to identify and repair corrupt data.
A. Review data for errors
1. missing data in time series
2. letter mixed up with numbers
3. symbols mixed up with alphanumeric characters
4. data types that will not work with available analytical tools, e.g. old file formats no longer supported by current technology
B. Correct errors

Assessing Module 2

Data organization is a difficult but necessary set of practices that data scientists must master and incorporate in their skill sets. Most if not all of it requires knowledge and understanding, the lower categories of Bloom's taxonomy of learning. It is mentally taxing because it requires enormous attention to minute details because in most instances, errors are rare, and therefore, difficult to identify unless they are glaring. Many tools exist to verify and clean data, but even these tools require some level of expertise to operate. Assessments of students' understanding of these activities should focus on determining their knowledge of potential problems with data and the ways to fix them.

Go back to Module 1 of Teaching data literacy: Idea formation & abstraction
Continue to Module 3 of Teaching data literacy
Jump to Module 4 of Teaching data literacy: Dynamic Data Analysis
Return to home page of Teaching data literacy