“An expert is one who knows more and more about less and less until he knows absolutely everything about nothing.”


Whether the study database consists of one or many tables and whether it uses spreadsheet, statistical, or database management software, a mechanism for populating the data tables (entering the data) is required.

Keyboard Transcription

Historically, the common method for populating a study database has been to first collect data onpaper forms. In clinical trials, a paper data collection form corresponding to a specific subject is commonly called a case report form or CRF. The investigator or a member of the research team may fill out the paper form or, in some cases, the subject himself fills it out. Study personnel can then transcribe the data via keyboard from the paper forms into the computer tables. Transcription can occur directly into the data tables (e.g., the response to question 3 on subject 10 goes into the cell at row 10, column 3) or via on-screen forms designed to make data entry easier and including automatic data validation checks. Transcription should occur as shortly as possible after the data collection, so that the subject and interviewer or data collector are still available if responses are found to be missing or out of range. Also, as discussed later in this chapter, monitoring for data problems (e.g., outlier values) and preliminary analyses can only occur once the data are in the computer database.
If transcribing from paper forms, the investigator may consider double data entry to ensure the fidelity of the transcription. The database program compares the two values entered for each variable and presents a list of values that do not match. Discrepant entries are then checked on the original forms and corrected. Double data entry identifies data entry errors at the cost of doubling the time required for transcription. An alternative is to double-enter a random sample of the data. If the error rate is acceptably low, double data entry is unlikely to be worth the effort and cost for the remaining data.

Distributed Data Entry

If data collection occurs at multiple sites, the sites can e-mail or fax paper forms to a central location for transcription into the computer database, but this practice is increasingly rare. More commonly, the data are transcribed at the sites directly into the study database via online forms. If Internet connectivity is a problem, data are stored on a local computer at the site and transmitted online or via a portable memory device such as a USB drive. Government regulations require that electronic health information be either de-identified or transmitted securely (e.g., encrypted and password-protected).

Electronic Data Capture

Primary data collection onto paper will always have its place in clinical research; a fast and user-friendly way to capture data on a nonvolatile medium is using pen and paper. However, hand writing data onto a paper form is increasingly rare. In general, research studies should collect data primarily using online forms. In clinical trials, electronic forms are called electronic case report forms (eCRFs). Data entry via online forms has many advantages:
  • The data are keyed directly into the data tables without a second transcription step, removing that source of error.
  • The computer form can include validation checks and provide immediate feedback when an entered value is out of range.
  • The computer form can also incorporate skip logic. For example, a question about packs per day appears only if the subject answered “yes” to a question about cigarette smoking.
  • The form may be viewed and data entered on portablewireless devices such as a tablet (iPad), smartphone, or notebook computer.
When using online forms for electronic data capture, it sometimes makes sense to print out a paper record of the data immediately after collection. This is analogous to printing out a receipt after a transaction at an automated teller machine. The printout is a paper “snapshot” of the record immediately after data collection and may be used as the original or source document if a paper version is required.

Coded Responses Versus Free Text

Defining a variable or field in a data table includes specifying its range of allowed values. For subsequent analysis, it is preferable to limit responses to a range of coded values rather than allowing free text responses. This is the same as the distinction made in Chapter 15 between “closed-ended” and “open-ended” questions. If the range of possible responses is unclear, initial data collection during the pretesting of the study can allow free text responses that will subsequently be used to develop coded response options.
The set of response options to a question should be exhaustive (all possible options are provided) andmutually exclusive (no two options can both be correct). A set of mutually exclusive response options can always be made collectively exhaustive by adding an “other” response. Online data collection forms provide three possible formats for displaying the mutually exclusive and collectively exhaustive response options: drop-down list, pick list (field list), or option group (Figure 16.5). These formats will be familiar to any research subject or data entry person who has worked with an online form. Note that the drop-down list saves screen space but will not work if the screen form will be printed to paper for data collection, because the response options will not be visible.

Figure16.5 Formats for entering from a mutually exclusive, collectively exhaustive list of responses
A question with a set of mutually exclusive responses corresponds to a single field in the data table. In contrast, the responses to an “All that apply” question are not mutually exclusive. They correspond to as many yes/no fields as there are possible responses. By convention, response options for “All that apply” questions use square check boxes rather than the round radio buttons used for option groups with mutually exclusive responses. As discussed in Chapter 15, we discourage “All that apply” questions and prefer to require a yes or no response to each item. Otherwise an unmarked response could either mean “does not apply” or “not answered.” In coding yes/no (dichotomous) variables, make 0 representno or absent, and 1 represent yes or present. With this coding, the average value of the variable is interpretable as the proportion with the attribute.

Importing Measurements and Laboratory Results

Much study information, such as baseline demographic information in the hospital registration system, lab results in the laboratory’s computer system, and measurements made by dual energy x-ray absorptiometry (DEXA) scanners and Holter monitors, is already in digital electronic format. Where possible, these data should be imported directly into the study database to avoid the labor and potential transcription errors involved in re-entering data. For example, in the study of infant jaundice, the demographic data and contact information are obtained from the hospital database. Computer systems can almost always produce tab-delimited or fixed-column-width text files that the database software can import. In clinical trials, this type of batch-uploaded information is referred to as “non-CRF (case report form) data” (1).

Data Management Software

Now that we have discussed data tables and data entry, we can make the distinction between the study database’s back end and front end. The back end consists of the data tables themselves. Thefront end or “interface” consists of the online forms used for entering, viewing, and editing the data.Table 16.1 lists some software applications used in data management for clinical research. Simple study databases consisting of a single data table can use spreadsheet or statistical software for the back-end data table and the study personnel can enter data directly into the data table’s cells, obviating the need for front-end data collection forms. More complex study databases consisting of multiple data tables require relational database software to maintain the back-end data tables. If the data are collected first on paper forms, entering the data will require transcription into online forms.
Table16.1 Some Software Used in Research Data Management
As discussed in Chapter 15, several tools, including SurveyMonkey, Zoomerang, and Qualtrics, exist for developing online surveys that will be e-mailed to study participants or posted on the study website. All of these tools provide multiple question format options, skip logic, and the capability to aggregate, report on, and export survey results.
Some statistical packages, such as SAS, have developed data entry modules. Integrated desktop database programs, such as Microsoft Access and Filemaker Pro, also provide extensive tools for the development of on-screen forms.
Research studies increasingly use integrated, Web-enabled, research data management platforms.REDCap (Research Electronic Data Capture) is a Web-based research data collection system developed by an academic consortium based at Vanderbilt University. It enables researchers to build data entry forms, surveys, and surveys with attached data entry forms. REDCap is made available to academic investigators only and must be hosted at the investigator’s institution. This is an outstanding “do-it-yourself” tool for beginning academic investigators that allows rapid development of surveys and on-screen data collection forms. It also provides access to a repository of downloadable data collection instruments. As with all do-it-yourself Web development tools, options for customization and advanced functionality are limited. A REDCap database consists of a single table containing one row for each of a fixed number of user-defined “events” per study subject. It does not permit detailed tracking of a large and variable number of repeated measurements per study subject, such as lab results, vital signs, medications, or call logs. REDCap also cannot do sophisticated data validation, querying (see later in this chapter), or reporting, but does make export into statistical packages easy.
Full-featured, Web-based research data management platforms such as QuesGenMediData RAVE, or Oracle InForm can accommodate complex data structures and provide sophisticated data validation, querying, and reporting. The companies that provide these tools also provide support and configuration assistance. While there may be some additional cost involved, these solutions are worth considering when the do-it-yourself tools lack the sophistication to meet the study’s requirements.


Once the database has been created and data entered, the investigator will want to organizesort,filter, and view (“query”) the data. Queries are used for monitoring data entry, reporting study progress, and ultimately analyzing the results. The standard language for manipulating data in arelational database is called Structured Query Language or SQL (pronounced “sequel”). All relational database software systems use one or another variant of SQL, but most provide a graphical interface for building queries that makes it unnecessary for the clinical researcher to learn SQL.
A query can join data from two or more tables, display only selected fields, and filter for records that meet certain criteria. Queries can also calculate values based on raw data fields from the tables.Figure 16.6 shows the results of a query on our infant jaundice database that filters for boys examined in February and calculates age in months (from birth date to date of exam) and BMI (from weight and height). The query also uses a sophisticated table-lookup function to calculate growth curve percentile values for the child’s BMI. Note that the result of a query that joins two tables, displays only certain fields, selects rows based on special criteria, and calculates certain values still looks like a table in datasheet view. One of the tenets of the relational database model is that operations on tables produce table-like results. The data in Figure 16.6 are easily exported to a statistical analysis package. Note that no personal identifiers are included in the query.
Identifying and Correcting Errors in the Data
The first step toward avoiding errors in the data is testing the data collection and management system as part of the overall pretesting for the study. The entire system (data tables, data entry forms, and queries) should be tested using dummy data. For clinical trials that will be used in an FDA submission, this is a regulatory requirement under Code of Federal Regulations, Chapter 21, Part 11 (21 CFR 11) (9).
We have discussed ways to enhance the fidelity of keyboard transcription or electronic data capture once data collection begins. Values that are outside the permissible range should not get past the data entry process. However, the database should also be queried for missing values and outliers (extreme values that are nevertheless within the range of allowed values). For example, a weight of 35 kg might be within the range of allowed values for a 5-year-old, but if it is 5 kg greater than any other weight in the data set, it bears investigation. Many data entry systems are incapable of doing cross-field validation, which means that the data tables may contain field values that are within the allowed ranges but inconsistent with one another. For example, it would be highly unlikely for a 35 kg 5-year-old to have a height of 100 cm. While the weight and height values are both within the allowed ranges, the weight (extremely high for a 5-year-o


Post a Comment

About Blogger:

Hi,I,m Basim from Canada I,m physician and I,m interested in clinical research feild and web development.you are more welcome in our professional website.all contact forwarded to basimibrahim772@yahoo.com.

Let's Get Connected: Twitter | Facebook | Google Plus| linkedin

Blog Tips

Subscribe to us