Digital Writing: Assessment and Evaluation

Chapter 1

Making Digital Writing Assessment Fair for Diverse Writers

Mya Poe

ABSTRACT

The move to large-scale digital writing assessment brings about considerations related to validity, reliability, and fairness. In this chapter, I discuss the issue of fairness in digital writing assessment and suggest possible ways that considerations of fairness may be brought into the design and use of digital writing assessment. Although there are many sources where we may look for guidance on fairness in assessing digital writing, a useful starting point is the Standards for Educational and Psychological Testing. The Standards offer agreed-upon principles for the design, administration, and interpretation of educational and psychological tests and have been adopted by writing assessment scholars. In this chapter, I explain several of the 12 fairness guidelines offered in the Standards and explain their significance for digital writing assessment. In the end, I argue that digital writing assessment must go beyond the inductive design of rubrics to include theoretically informed fairness inquiries to ensure that we are working to make digital writing assessment equitable for all students.

Studies by the Pew Internet and American Life Project (Lenhart, Arafeh, Smith, & MacGill, 2008) as well as projects such as the Stanford Study of Writing (2008) remind us that digital composing is omnipresent in the lives of many young people today. As a consequence, composition studies scholars—well aware of the need to bridge digital literacies and academic literacies—have explored numerous ways to integrate multimodal, networked, and other digital technologies in writing classrooms (see, for instance, Kimme Hea, 2009; Selfe, 2007; Wysocki, Johnson-Eilola, Selfe, & Sirc, 2004). While digital technologies have undoubtedly brought enormous potential in expanding the teaching of writing (NCTE, 2008; New London Group, 1996; Selber, 2004; Selfe, 1999), how to assess digital writing has raised new questions, such as how digital media change the possibilities of composing and, thus, the construct of writing (Yancey, 2004); how readers interact with digital texts (Takayoshi, 1996); and how we can link digital writing assessment to student learning (Huot, 1996a). Digital writing assessment also raises issues related to fairness, especially in assessing the digital writing of diverse groups of students.

Issues of fairness are particularly important in large-scale digital writing assessment, assessments in which large numbers of students are assessed across departments, institutions, states, or countries often for diagnostic or proficiency purposes (Whithaus, 2005).¹ Large-scale assessments, like placement or proficiency testing, often dictate student ability to access educational resources or advance through the higher educational system. Moreover, as Michael Neal (2011) pointed out, states are moving to include digital writing in state-level testing programs:

Digital technologies are embedded within the fabric of many composition outcomes and state writing standards that steer instruction and assessment of student writing. In my home state of Florida, the Sunshine State Standards affirm that a student “selects and use a variety of electronic media, such as the Internet, information services, and desk-top publishing software programs, to create, revise, retrieve, and verify information.” (9–12 Grade Language Arts, LA.B.2.4.4) (p. 4)

Similarly, the Common Core State Standards for English Language Arts, upon which the Florida Sunshine State Standards were based, include digital writing, defining college and career readiness not only in terms of student ability to analyze audience and use evidence, but also in the strategic and capable use of technology and digital media:

Students employ technology thoughtfully to enhance their reading, writing, speaking, listening, and language use. They tailor their searches online to acquire useful information efficiently, and they integrate what they learn using technology with what they learn offline. They are familiar with the strengths and limitations of various technological tools and mediums and can select and use those best suited to their communication goals. (p. 7)

As such examples illustrate, digital writing assessment is both becoming commonplace and is laden with a range of ideological values. As a result, making digital writing assessment fair means attending to the construct and consequences of digital writing.

Yet, although composition scholars have investigated digital writing assessment issues related to mode, collaboration, disciplinary knowledge, and development of grading criteria (Adsanatham, 2012; Murray, Sheets, & Williams, 2010; Remley, 2011, 2012; Sorapure, 2006), and eportfolio scholarship includes a range of studies related to student learning (Cambridge, Cambridge, & Yancey, 2009), we have few validation studies of digital writing assessment to tell us about the impact of those assessments on students of color, working class students, and students with disabilities. Instead, research on digital writing assessment has conventionally adopted a color-blind, homogenizing approach. For example, the January/March 2012 issue of Technical Communication Quarterly features four approaches on the assessment of digital writing, none of which engage with the topic of diversity. Instead, the TCQ contributors focus on the development of assessment criteria from spaces in which few guidelines are available; for example, Matt Morain and Jason Swarts (2012) used descriptive data “from a constant comparative study of user-rated YouTube videos” (p. 13) to develop a scoring rubric that includes physical, cognitive, and affective dimensions. This bottom-up approach to rubric design is correct in that it seeks to tie assessment to what is valued in a specific assignment, but it flattens user identity. Notably missing in the Morain and Swarts article is a discussion of the raced assumptions in such tutorials—for instance, who is the author of instructional content?

Scholars, such as Kathleen Blake Yancey, who have studied the potential consequences of large-scale digital writing assessment for diverse students offer compelling findings. Yancey (2009) wrote that research conducted by the Inter/National Coalition for Electronic Portfolio Research group showed higher levels of student engagement when reflection was included in eportfolios; at LaGuardia Community College, Yancey noted “engagement is translating into higher course completion and retention rates” (p. 13). That’s good news for institutions like LCC, whose student body is 50% Latino/a, more than 20% Asian, and almost 20% black (Office of Institutional Research, 2011).

My critique here is not to suggest that digital writing scholars are unaware of how access, use, and attitudes toward technology shape student interactions with digital technology and ultimately the kinds of writing they produce (Banks, 2006; Redd, 2003) or that they are remiss in thinking about issues of diversity and access (Grabill, 2003; Powell, 2007). Indeed, digital writing scholars have certainly been enormously active in considering the ethical issues involved in researching and teaching with digital media (Coley, 2012; DeVoss, & Porter, 2006; McKee, 2008; McKee & DeVoss, 2007). However, our understanding of these issues has not translated into research on digital writing assessment, especially validity studies of large-scale digital writing assessment. As an assessment researcher, that disjuncture gives me pause; we cannot design fair assessments and make meaningful decisions about assessment outcomes if we do not have research to support the assumptions we are making about digital literacies.

In this chapter, I discuss large-scale digital writing assessment of diverse student populations through the framework of fairness.² In what follows, I first describe two ways that assessment and technology have been theorized. I then discuss several important shifts in contemporary writing assessment theory concerning validity as found in the measurement community (O’Neill, 2011; Moss, 1992; Weigle, 2002; Williamson & Huot, 1993). I also detail how fairness is integral to validity, meaning that no assessment can be considered usable unless it is also deemed fair. Although this connection to educational measurement theory may initially seem odd given our field’s contentious history with measurement, it is important to recognize that certain shared theoretical perspectives are now common in writing assessment scholarship (Behizadeh & Engelhard, 2011; Moss, 1992). Those shared theoretical perspectives, especially on validity and fairness, are paramount to acknowledge in composition research on digital writing assessment. Finally, drawing on the Standards for Educational and Psychological Testing, I identify specific approaches that we can use in gathering and interpreting evidence from digital writing assessments that will help inform fair, valid decisions about those assessment outcomes.

In the following discussion, many of my examples include ethnically diverse populations, however, I do not wish to suggest that racial or ethnic diversity is the only form of diversity that we should attend to in digital writing assessment research. If we are to embrace the interplay of construct, consequence, and fairness in digital writing assessment, then we must attend to the specific forms of diversity found within a specific context and pay special attention to those groups most likely to suffer from poor decisions made from writing assessments.

WRITING ASSESSMENT AND TECHNOLOGIES

In Writing Assessment and the Revolution in Digital Texts and Technologies, Michael Neal (2011) traced two themes: (1) writing assessment as technology, and (2) writing assessment with technology. Writing assessment as technology has been a popular framework in assessment scholarship for two decades. Composition scholars like Brian Huot (1996b), Peggy O’Neill (1998), and Asao Inoue (2009) draw from George Madaus (1993) and F. Allan Hanson (1993) in framing assessment as a technology. Madaus explained that testing fits even “very simple definitions of technology—the simplest being something put together for a purpose, to satisfy a pressing and immediate need, or to solve a problem” (pp. 12–13). Unlike other technologies, however, accessible to the public, assessment technologies may be widely applied but not widely transparent to the public because of the testing community’s closed membership and specialized vocabulary. Although the testing community purports objectivity, Madaus argued that tests are “culturally constructed realities” (p. 223). Those realities, F. Allan Hanson (1993) wrote, are the realities of test designers—the tests “act to transform, mold, and even to create what they supposedly measure” and the testing situations “entail the application of power over the subjects of tests” (p. 52). O’Neill wrote that writing assessment, like other assessment technologies, is a “technical craft [that] is embedded in sociotechnical systems” and “shares the same power and biases inherent in all technology” (p. 7). In Race, Rhetoric, and Technology, Adam Banks (2006) echoed this theme:

All technologies come packaged with a set of politics: if those technologies are not inherently political, the conditions in which they are created and in which they circulate into a society are political and influence their uses in that society (Winner, 1996), and those politics can profoundly changes the spaces in which messages are created, received, and used. (p. 23)

The scholarship on assessment as technology is clearly indebted to critical theories of technology, which posit, as Andrew Feenberg (2005) argued, that “technologies have distinctive features… while also exhibiting biases derived from their place in society” (p. 47). When technologies are placed outside of local control, when they are given "operational autonomy," then those for whom such technologies were developed can “reproduce the conditions of their own supremacy at each iteration of the technologies they command” (Feenberg, p. 53). In writing assessment, for example, when we uphold advancements in assessments as technological accomplishments and not constructed realities, we end up with “an uncritical acceptance of technological changes so that unintended, or undesirable, results are not exposed” (O’Neill, 1998, p. 9).

This relationship between technology and its implications for how we assess writing is central to Neal’s (2011) second theme—“writing assessment with technology.” In tracing the digital technologies of assessment, Neal identified both computerized scoring of writing and assessment of digital writing. In scholarship on automated essay scoring (AES), for example, scholars have shown that digital technologies for assessing writing have brought new problems as well as new possibilities (Ericsson & Haswell, 2006; Herrington & Moran, 2001; Elliot & Williamson, 2013). In regards to diversity, the literature on AES is mixed. Sara Weigle (2013), for example, showed that e-rater is as reliable as human graders for scoring the writing of multilingual students and Andrew Klobucar, Norbert Elliot, Perry Dees, Oleksandr Rudniy, and Joshi Kamal (2013) showed that Criterion could successfully be used to identify “at-risk” students for first-year writing courses. Anne Herrington and Sarah Stanley (2012), on the other hand, questioned the raced assumptions in AES programs like Criterion:

What message does Criterion® send to teachers and students alike about the relationships among language, race, and teaching? . . .What unacknowledged ideologies are implicit in our beliefs about language and our students, and what ideologies do we promote using the technology of Criterion®—not only ideologies about language, but also about the function of response to student writing? (p. 48)

To answer those questions, Herrington and Stanley used Criterion to “respond” to a student essay, an essay that included markers of African American English. Using Criterion’s responses—including inaccurate advice—to that essay, they argue that programs like Criterion promote de facto racial ideologies through a narrow standard of correctness. Herrington and Stanley’s conclusions remind us that the assumptions built into automated essay scoring systems such as Criterion and other digital assessment technologies are the raced ideologies of their designers. Unlike human graders who can be trained easily to recognize such discursive variety (Smitherman, 1993) or differences in dialect features (Johnson & Van Brackle, 2011; Ochsner & Fowler, 2012), machines lack that ability. And although machines may score as reliably as humans, they cannot determine whether decisions made from test outcomes are good ones. In other words, machines cannot conduct validation studies.

VALIDITY, RELIABILITY, AND FAIRNESS

In composition studies today, most writing assessment is based on a contextualized model of assessment, in which assessments are designed, delivered, and interpreted locally (see Condon, 2011, for a review). Such “shared evaluation” or “organic” procedures have helped define writing assessment in a theoretically informed manner and in ways meaningful to writing teachers because they use teacher expertise to guide assessment decisions (Broad et al., 2009; Royer & Gilles, 2003).

Contextualized views of writing assessment are consistent with newer conceptions of validity as found in the measurement literature (Moss, 1992). Validity—“the degree to which evidence and theory support the interpretation of test scores entailed by the proposed uses of tests”—is “the most fundamental consideration” in developing, designing, and using assessments (AERA/APA/NCME, 1999, p. 9). Contemporary validity theory is indebted to the work of the late ETS senior researcher Samuel Messick (1990), who argued that validity was not a series of individual empirical components, but, rather, a unified concept based on empirical evidence and theoretical rationales, which “support the adequacy and appropriateness of interpretations and actions based on test scores or other modes of assessment” (p. 1). Messick (1994) wrote that in moving from a task-based conception of assessment to a construct-based conception of assessment, we “begin by asking what complex of knowledge, skills, or other attributes should be assessed, presumably because they are tied to explicit or implicit objectives of instruction or are otherwise valued by society” (p. 16). We then inquire what behaviors or performances demonstrate that construct.

But validity is not simply designing and delivering a test; validity is an ongoing process. In the validation process, we consider how test scores will be used and gather a variety of evidence to help us make a determination if those interpretations are meaningful. In other words, we validate the decisions we make about test outcomes, not necessarily the test itself. In making validation decisions, we can consider:

the relationship between the content of the test and the construct it is designed to measure;

the response processes of test-takers and raters;

the relationship of test scores to external measures, such as other assessments supposed to measure the same construct;

the intended and intended uses of test scores; for example, in the case where a test yields unequal outcomes because it contains construct-irrelevant materials (“the extent to which test scores are influenced by factors that are irrelevant to the construct that the test is intend to measure”) or suffers from construct underrepresentation (“the extent to which a test fails to capture important aspects of the construct that the test is intended to measure”; AERA/APA/NCTM, 1999, pp. 173–174).

Messick (1989, 1994) thought that construct-irrelevant variance and construct underrepresentation were especially likely to be sources of negative consequences for women and racial/ethnic groups, and he wrote frequently about the importance attending to issues of validity in ensuring that tests were fair.

NCTE-WPA White Paper on Writing Assessment in Colleges and Universities

The Fair use of writing assessment is crucial, since it can be used to make important decisions about individuals. A concern for fairness should guard against any disproportionate social effects on any language minority group. Writing assessments that are used to make important decisions about individuals and the material and educational conditions that affect these individuals should provide an equal opportunity for students to understand the expectations, roles, and purposes of the assessment... (emphasis added)

The second concept integral in educational measurement theory is reliability—“the consistency of [tasks and scoring procedures] when the testing procedure is repeated on a population of individuals or groups” (AERA/APA/NCME, 1999, p. 25). Reliability, as Yancey (1999) has pointed out, has been at the center of much writing assessment scholarship in the 20th-century, as designers worked to find evaluation methods and scoring procedures that could be repeated with consistency.

Beyond validity and reliability is a third fundamental component of assessment: fairness. As ETS researcher Doug Baldwin (2012) explained, in the measurement community, fairness means “assessment procedures that measure the same thing for all test takers regardless of their membership in an identified subgroup” (p. 328). In Educational Measurement, Gregory Camilli (2006) noted that:

Concerns about fairness arise from the intended and unintended consequences of testing. Fairness is thus not a property of a test per se, and for this reason, investigations of fairness are framed by test use. (p. 251)

Fairness, thus, is tightly linked to validity and always a matter of degree because no one test can be exactly the same thing for every test-taker. As a result, the important issues in educational measurement in terms of fairness are about ensuring lack of bias and ensuring that assessments are accessible (see Berk, 1982; Camilli & Shepard, 1994). In the composition community, fairness is also linked to validity and appropriate use of test results. As detailed in the NCTE-WPA (2008) paper on Writing Assessment in Colleges and Universities, fairness also means attending to the social effects of assessing: “a concern for fairness should guard against any disproportionate social effects on any language minority group.” Although the broad interpretation of fairness found in composition studies is useful, writing program administrators, researchers, and teachers need specific advice on ensuring that digital writing assessment is fair. In the following section, I suggest how we can draw on the measurement literature to develop guidelines for fair digital writing assessment.

FAIRNESS AND DIGITAL WRITING ASSESSMENT

Although there are many sources where we may look for guidance on developing fairness guidelines in assessing digital writing, such as the ETS Standards for Quality and Fairness (2002), a useful starting point is the Standards for Educational and Psychological Testing (1999). The Standards, set forth by the American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education, offer agreed-upon standards for the design, administration, and interpretation of educational and psychological tests. Writing assessment researchers such as Brian Huot (1990), Peggy O’Neill (2003), Chris Gallagher (2009), Diane Kelly-Riley (2006, 2011), and Inoue and Poe (2012) have embraced the Standards because they help bridge the fields of composition and educational measurement.

The Standards are an important articulation of assessment principles for test designers and practitioners, covering such topics as test construction, fairness in testing, and testing applications. Unlike other publications, such as Educational Measurement, which charts current theoretical positions in the field of measurement, or articles in journals such as Assessing Writing or Educational Measurement: Issues and Practice that present current empirical research, the Standards functions more like an agreed-upon position statement. Thus, the goal of the Standards is to “promote the sound and ethical use of tests and [provide] a basis for evaluating the quality of testing practices” (AERA/APA/NCME, 1999, p. 1).

According to the Standards, fairness may be defined as lack of bias, equitable treatment in the testing process, equality in outcomes of testing, and opportunity to learn. Although fairness and validity are currently found in separate sections of the Standards, fairness will be integrated in discussions of validity in the forthcoming revision to the Standards. As explained in a 2010 presentation to the American Educational Research Association, the rationale for this change is as follows:

Fairness in testing cannot be separated from accessibility.
Individuals should be able to understand and respond without performance being influenced by construct-irrelevant characteristics.
All examinees that the test is intended for should have an unobstructed opportunity to demonstrate their standing on the construct(s) being measured by the assessment. (Lane, Plake, Herman, Cook, & Worrell, 2010)

What’s important here for digital writing assessment, and which I will explain in more detail later, is that accessibility and opportunity to demonstrate ability are integral to any digital assessment design and the use of assessment results. Fair digital writing assessment, I contend, means that we can make valid, ethical conclusions from assessment results so that we may provide all students the opportunity to learn. Fairness, in other words, is not about making better assessments; it’s about making better decisions—decisions that have a meaningful, positive impact on student learning.

THE FAIRNESS GUIDELINES AND DIGITAL WRITING ASSESSMENT

The twelve guidelines for achieving fairness in testing in the Standards for Educational and Psychological Testing include a range of considerations (see Appendix for full list). Seven of the guidelines cover topics related to test interpretation and use while five others address specific issues related to statistical bias or “sensitivity.” In the following discussion, I point to a subset of those recommendations and suggest how they can be used in assessing digital writing with particular concern for their implications for assessing the digital writing of diverse students. Note that I am not suggesting in my discussion that we should rely solely on educational measurement theory, but, rather, that we can build on the insights of that field as we develop our digital writing standards for fair assessment. I also do not mean to suggest that the subset of guidelines that I discuss here are more important than those not discussed. It is simply beyond the scope of this article to address every fairness standard.

Design Assessment

Standard 7.4
Test developers should strive to identify and eliminate language, symbols, words, phrases, and content that are generally regarded as offensive by members of racial, ethnic, gender, or other groups, except when judged to be necessary for adequate representation of the domain.

Fair assessment begins with thoughtful design. Design includes how we theorize the construct “digital writing” and ensure that we have a meaningful assessment process in place. Although many composition researchers are well-aware of considerations such as those detailed in Standard 7.4—that every test must be checked for “offensive” content—I find that many writing assessment designs do not include a clear process from the student point of view. Every student not only needs to take a test in a comparable context but also understand the purpose of the test, how they will be scored, and the possibilities for appealing the results.

Standard 7.12
The testing or assessment process should be carried out so that test takers receive comparable and equitable treatment during all phases of the testing or assessment process.

Often students are told what test results mean (e.g., a score above 5 means “pass.”), but they are not told that their eportfolios or placement exams may be re-read or that they can contact anyone about the test process. Standard 7.12 reminds us that fairness must extend through the entire assessment process. In this regard, the advice from the Standards echoes the NCTE white paper on assessment, which states that students need an “equal opportunity to understand the expectations, roles, and purposes of the assessment.”

Interpreting Results Begins with Collecting Data

It is impossible to determine if an assessment is fair if no information has been gathered on the students who have taken the assessment. The fairness guidelines in the Standards assume the collection of test-taker demographic data. In fact, as Camilli (2006) wrote, “in typical fairness analysis, the first step is to define groups and then to compare differences between groups R [reference or base] and F [focal] in terms of percent selected (say for a job or college position), or average scores of an item on a test” (p. 225). Unfortunately, although many writing programs may know the demographics of entering students, they do not report placement exam scores or other large-scale assessment results with that demographic information. Programs often simply sort students into categories of basic writing, first-year writing, honors, and exempt from writing without identifying international students, students of color, students with disabilities, and so on. As a result, writing program administrators are often at a loss to determine how well their writing courses are helping subgroups of students.

But what data should be collected? Gather data that will help answer questions about the specific students at an institution, because as Camilli (2006) wrote, “fairness issues are inevitably shaped by the particular social context in which they are embedded” (p. 221). At some institutions data may include race as well as language or ethnicity. Other indicators—like a student’s age and military status—are also relatively easy to gather on a survey. Socioeconomic status is trickier because students may not know parent wealth or income and parent educational levels are useful but not always reflective of class. Often in my research, when I inquire about socioeconomic status, I work with an institutional research office to gather a variety of indicators, including zip codes.

One consideration in data gathering is linking writing program data to institutional data. As Asao Inoue and I have explained (2012), although we like the term racial formation (Omi & Winant, 1994) because it points to the constructed nature of identity in the U.S. with all the history surrounding that construction, most institutions gather racial/ethnic data based on some configuration of U.S. census categories. Although such categories are not always “pretty” for analysis because they don’t fit our humanistic sense of shifting identities, raced terms like “white, non-Hispanic” are identifiable across institutional data sets whereas categories like Euro-American are not.³

It is also worth pointing out that not all identity categories are consistent over time. For example, Ed White and Leon Thomas (1981) looked at student performance on two college entrance tests for the California higher-education system—the Test of Standard Written English (TSWE), a selected-response test with only questions on English language usage and the English Placement Test (EPT), a test that included selected-response questions and a constructed response essay. (Selected response tests include multiple choice and other kinds of tests that offer test-takers a limited range or options while constructed response tests offer students a chance to write a response.) Students’ different scores on the two tests cast doubt on the TSWE’s validity for measuring writing ability, and the test was eventually rejected. Asao Inoue and I (2012) looked at the test that was kept—the EPT—almost 30 years later to see how it held up over time: How were student scores on the EPT today compared to 30 years ago? One of our findings (Table 1) was that Asian Pacific Islander (API) student scores dropped between 1978 to 2008. Moreover, in 2008, their scores were the lowest of all students who took the test.

	N for EPT Total	Avg. EPT Total	SD for EPT Total
Fall 1978
White	5,246	152.5	5.6
Black	585	140.8	9.7
Mexican-American	449	145.7	8.6
Asian Pacific Islander*	617	146.3	10.0
Native American **	N/A	N/A	N/A
ALL	10.719	150.1	7.9

Fall 2008
White	578	148.0	6.7
Black	219	140.5	7.9
Latino/a	884	141.6	8.1
Asian Pacific Islander	450	138.7	8.8
Native American	15	146.6	6.0
All	2,251	142.7	8.6

* Labeled "Asian-American" in White and Thomas.
** Not reported in White and Thomas.

Table 1. EPT Total Results Reported in White and Thomas (1981) andCSUF's Office of Institutional Research (2008)

Why the change in test scores? After looking into various factors, we posited that the drop in scores had to do with the changing Asian demographics on the Fresno campus, namely that Cal State Fresno had an influx of Hmong students during the 1980s that shifted the demographic composition of the category “Asian.” Those Hmong students had linguistic markers in their writing that EPT essay scorers did not find acceptable (Inoue & Poe, 2012). In other words, in 30 years, the Asian Pacific Islander group label stayed the same, but the individual members of that group in the California system, especially at places like Fresno, changed dramatically. Ignoring that demographic change would have led to misinterpretation of longitudinal assessment results. (David Gillborn and Heidi Safia Mirza, 2000, also have an excellent longitudinal analysis of test-takers in the UK.)

One way to identify differences in local student populations over time is to supplement data with additional qualifiers, such as language or national citizenship. In the following placement exam example, only 5% of incoming students are placed into basic writing, but Asian students make up 50% of those basic writing students. And a little more demographic data reveals that those students are not international students, but primarily U.S. citizens, (that is, local Hmong students). Asking students a few simple questions—ideally, after they have taken the exam to avoid student fears about racism or bias—would have revealed this simple but compelling information.

It could be argued that the same information could be gathered by asking teachers about the students in their classrooms, but that qualitative impression of student demographics may not be accurate and isn’t sufficient to conduct a statistical analysis of test results. The take-away message here for assessing digital literacy is that we need to collect locally sensitive data that inform our interpretation of assessment results.

Collecting Evidence about Students' Digital Identities

In addition to demographic information, we also need information about student digital identities (Warschauer & Matuchniak, 2010). Digital access is certainly an obvious place to start in asking about students’ digital identities, and the CCCC (2004) Position Statement on Teaching, Learning, and Assessing Writing in Digital Environments calls for programs to “assess students’ access to hardware, software and access tools used in the course, as well as students’ previous experience with those tools.” However, access to technology is not solely about a binary “digital divide”—digital “haves” and “have not’s” (Grabill, 2003; Hoffman & Novak, 1998). The digital divide is about “differential access to, contact with, and use of [information and computer technologies] cross-nationally... as well as between social and demographic groups within individual nations” (Jones, Johnson-Yale, Millermaier,, & Perez, 2009, p. 244).

An expanded definition of the digital divide means that in collecting information about student digital identities, we need to ask more than simple questions about access to technology. As Mark Warschauer (2003) argued, if we only focus on the digital divide, then we are likely to assume that better assessment outcomes will come with better access to technology. A more fine-tuned approach in which we collect evidence, such as the Pew Internet and American Life Project (Lenhart, Arafeh, Smith, & McGill, 2008; Lenhart, 2012; Lenhart, Purcell, Smith, Zickuhr, 2010; Madden, Cortesi, Gasser, Lenhart, & Duggan, 2012; Madden, Lenhart, Duggan, Cortesi, & Gasser, 2013) has done—on frequency and conditions of access, type and place of access, attitudes towards digital technology, prior experiences and parental influences, and kind of devices that individuals use, will likely yield evidence that will help us make more informed decisions about our assessments.

Drawing on the research literature can help us craft survey and other data collection instruments. For example, research has pointed to broad racial and gender differences in technology use (Compaine, 2001; Livingston, 2010; Wei & Hindman, 2011). Evidence about the relationship between racial identity and technology has also been shown in studies, for example, of where students have learned technologies (Jones, Johnson-Yale, Millermaier, & Perez, 2009), and email use (Jackson, Erving, Gardner, & Schmitt, 2001). As Barbara Monroe (2004) showed in Crossing the Digital Divide: Race, Writing, and Technology in the Classroom, ethnography is also a valuable research tool in understand digital literacy. Monroe wrote that community histories shape student interactions with technology and public audiences and reminded us that many of our assumptions about how people use digital technology are based on academic frames of interpretation, not local meanings.

Of course, getting good self-reported data about digital literacies is not always easy. One reason is that students do not recognize many digital ways of composing as “writing.” For example, they may not consider games and other out-of-school ways of interacting with technology as prior experiences with technology worth reporting in a survey associated with a writing test or eportfolio. Students may also overestimate their technological abilities. Just because students report having frequently used a slideshow-creation application, for example, does not mean they are entirely proficient in that technology in a way that will be expected in academic contexts. Finally, while some schools provide students remarkably rich learning environments in which technology is used in creative, smart ways to teach writing (Herrington, Hodgson, & Moran, 2009), other schools have much more limited uses of technology in classrooms. In such cases, technology may be used in service of a narrowly defined writing outcome, such as the ability to produce a certain number of words within an allotted time, or there may be more restrictions on the kinds of digital products students can produce (e.g., file size, available software, etc.; Monroe, 2004). Thus, simply querying students about past digital writing experiences without asking about the nature of those interactions can lead to misconceptions about the kinds of digital academic literacies that students possess.

In the end, research tells us that student relationships to technology are complex, thus we are likely in need of multiple measures to understand student digital identities and how those identities inform their interactions with digital writing assessment. Through surveys, ethnographic research, and other approaches, we need not simply gather data but attend to local meanings of technology use that inform how students come to digital writing assessment.

Interpreting the Evidence in Context

As Norbert Elliot and Les Perelman (2012) wrote, “the point of demarcation between the educational measurement community and the writing assessment community has been on qualitative practice because it is there that classroom assessment is situated” (p. 3). Until recently, the measurement community has subordinated classroom assessment to assessment beyond the classroom (e.g., state and national-level testing). Yet, in composition studies, the connection between large-scale assessment and classroom teaching has been fundamental to the ways that we think about assessing writing (Elbow & Belanoff, 2009; Huot, 2002; Peckham, 2012; White, 1994). The decisions we make about digital writing assessment, composition researchers tell us, should be framed within the contexts in which those results will be used. The point of collecting information about students is so that we can make better decisions about how to serve those students, not simply to produce tables of outcomes measured. (See Inoue, 2012, and Kelly-Riley, 2011, for two analyses of program-level interpretations.)

Standard 7.1
When credible research reports that test scores differ in meaning across examinee subgroups for the type of test in question, then to the extent feasible, the same forms of validity evidence collected for the examinee population should also be collected for each relevant subgroup. Subgroups may be found to differ with respect to appropriateness of test content, internal structure of test responses, the relation of test scores to other variables, or the response processes employed by individual examinees. Any such findings should receive due consideration in the interpretation and use of scores as well as in subsequent test revisions.

Once data are collected and we begin to analyze results, we may find differences in test scores. Standard 7.1 states that if we have evidence that groups are performing differently on the test, then we need to explore those differences in relation to validity. Simply put, it’s not simply enough to conclude that certain students have different outcomes. If Latina students performed differently on our eportfolio than other subgroups, we need to gather data on all the student subgroups who submitted eportfolios. We then need to figure out why those differences occurred. The goal here is that once groups have been identified and we have obtained scores, “the important task is to distinguish a genuine group of differences in proficiency from one that arises from a distorted measurement process” (Camilli, 2006, p. 225). It is not necessarily a problem if some students come more poorly prepared for the kinds of digital writing expected in college writing classrooms; the problem comes when our curriculum doesn’t do something about that disparity, and we perpetuate those inequalities through the combination testing and curricular interventions.⁴. Likewise, if we find that students responded differently, it may have been the result of construct irrelevance. For example, in a placement exam question about Facebook, we assumed all students knew the workings of Facebook; however, the Chinese students at our institution were familiar with Renren. Was knowledge of Facebook central to the construct of digital literacy that we were trying to measure on the placement exam?

Knowledge of software is an obvious source of potential construct irrelevance. In my opinion, the issue of construct irrelevance is also worth considering around assumptions about network speed, privacy, and platform. For students used to slower network speeds, their understanding of a timed digital writing task will be quite different than students accustomed to working on devices with faster network connections. And international students may have quite different responses to a digital writing task that focuses on a theme about networks based on their experiences with censorship in their home countries. Finally, students primarily accustomed to working on mobile devices may find certain kinds of tasks harder than students familiar with desktop and laptop technologies.

Construct irrelevance is also an important consideration beyond software and hardware to the kinds of collaborative tasks we expect students to be able to complete digitally, the kinds of questions we expect students to be able to pose and answer in digital environments, and in evaluating multimedia texts. Ideally, these are all questions to be worked out in the design of digital writing assessments; yet, we are also likely to find these issues as we analyze results because there has been such little work on student response processes to digital writing tasks.

Once differences in assessment outcomes have been analyzed, then we have to decide what to do.

Standard 7.2
When credible research reports differences in the effects of construct-irrelevant variance across subgroups of test takers on performance on some part of the test, the test should be used if at all only for those subgroups for which evidence indicates that valid inferences can be drawn from test scores.

Will we keep those placement exam scores on the Facebook prompt? Will our basic writing class best serve those students as that class is currently delivered? Standard 7.2 states that we should only use digital writing assessment results if we can make sure that they permit us to make good decisions. This guideline reminds us that no one prompt can fully capture the construct of writing as its theorized today in the composition literature. Most large-scale assessments, such as eportfolios, include more than one task so that we have multiple measures of student writing.

In the end, I want to make one other point about interpreting assessment results: that is, the fairness guidelines state nothing about statistical significance. Again, Camilli (2006) noted that:

many unfair test conditions may not have a clear statistical signature, for example, a test may include items that are offensive or culturally insensitive to some examinees. Quantitative analyses infrequently detect such items; rather such analyses tend to focus on the narrower issues of whether a measurement or prediction model is the same for two or more groups or examinees. (p. 221)

Standard 7.8
When scores are disaggregated and publicly reported for groups identified by characteristics such as gender, ethnicity, age, language proficiency, or disability, cautionary statements should be included whenever credible research reports that test scores may not have comparable meaning across groups.

Statistical analyses are enormously important in many instances of determining fairness of individual assessments and for predicting future performance (Elliot, Dees, Rudniy, & Joshi, 2012, provide an eloquent example of how statistical analysis revealed predictive flaws in one writing assessment). However, statistical analysis is not an absolute requirement in interpreting assessment results. In analyzing digital writing assessment data, a researcher may likely not have a large enough sample size to make claims of statistical significance (e.g., 95% confidence level). This does not mean such data should be ignored, but that the researcher should find other ways of interpreting the results.

Framing the Consequences of Fairness

Our responsibility in delivering large-scale digital writing assessment does not end once we have placed students into first-year writing classes or written a summary of general education outcomes to a provost. Our responsibility extends to framing assessment results for public consumption, and the fairness guidelines offer several strong statements about professional responsibility in articulating the “consequences” of assessment outcomes (Messick, 1989). The first, Standard 7.8, places responsibility on researchers to explain potential inconsistencies in the meaning of test scores, stating that “cautionary statements should be included whenever credible research reports that test scores may not have comparable meaning across groups” (1999, p. 83).

Standard 7.9
When tests or assessments are proposed for use as instruments of social, educational, or public policy, the test developers or users proposing the test should fully and accurately inform policymakers of the characteristics of the tests as well as any relevant and credible information that may be available concerning the likely consequences of test use.

Responsibility of researchers is again the subject of Standard 7.9, stating that “test developers or users proposing the test should fully and accurately inform policymakers of the characteristics of the tests” (1999, p. 83, emphasis added). Taken together, it’s clear that in assessing digital writing, we are responsible for how the results of our assessments are disseminated. In an attempt to provide fair assessment, we have to narrate the findings of our assessments. It is up to digital writing researchers to articulate a vision of technology and writing that promotes learning.

CONCLUSION

The assessment of digital writing holds much promise for tapping the many literacies that students now bring to college writing classrooms. But large-scale assessment of digital writing also brings new responsibilities to teachers and program administrators if we are to make digital writing assessment fair. In large-scale digital writing assessment, it is not simply enough to articulate criteria on rubrics; we also need to use the best practices articulated in the writing assessment literature and in the measurement community, such as in the Standards for Educational and Psychological Testing.

Today, fairness is integral to validity and, thus, integral to any large-scale digital writing assessment. The fairness guidelines offered in the Standards provide us a way to place fairness at the center of digital writing assessment. They provide us ways to think about the interpretation and use of assessment results. The Standards, however, are only a starting place; we also need to build on the Standards with the expertise on assessing student writing developed over the last 40 years in composition studies. From my point of view as a member of that composition community, the most important interpretation drawn from any large-scale assessment is if we are making good decisions about learning—good decisions about all students who come to our writing classrooms. Contextualization is the hallmark of contemporary writing assessment and, as such, should be central in any large-scale digital writing assessment. When we do not interpret our digital writing assessments within contextualized frameworks that pay attention to issues of fairness, then we are likely to perpetuate social inequalities, thus undermining opportunities to learn, and, in my opinion, undermining good writing instruction.

NOTES

1. In this chapter, I am primarily interested in focusing on digital writing assessment, such as digital multimodal projects, but these recommendations can also be applied to contexts in which students produce more conventional essayistic forms on computers and even computerized scoring of writing. It should be noted that I have additional reservations about computerized scoring of writing that go beyond my discussion here (see Condon, 2013).↩

2. The Standards describe the technical obligations involved in test fairness. Gregory Camilli (2006), in Educational Measurement, offers an explanation of the legal obligations involved in test fairness in addition to the technical obligations.↩

3. These categories can be found through the U.S. Office of Management and Budget. The other reason we like to focus on race or racial formations rather than linguistic identity markers alone, such as African American English or Chinese English, is that not all members of a particular group may use that particular dialect. As David Holmes (1999) argued, speakers within a racial formation may have quite varied linguistic practices. All members of that racial formation, however, may experience prejudice, regardless of their linguistic practices. Linguistic identities can be a subset of information gathered for further analysis.↩

4. I hedge here, given that my statement may suggest a simplistic notion of under-prepared students. As Mike Rose (1995) and many other composition scholars have shown, under-prepared students often come with a long history of not being served by assessment and teaching.↩

APPENDIX

Fairness Guidelines from the Standards for Educational and Psychological Testing

When credible research reports that test scores differ in meaning across examinee subgroups for the type of test in question, then to the extent feasible, the same forms of validity evidence collected for the examinee population should also be collected for each relevant subgroup. Subgroups may be found to differ with respect to appropriateness of test content, internal structure of test responses, the relation of test scores to other variables, or the response processes employed by individual examinees. Any such findings should receive due consideration in the interpretation and use of scores as well as in subsequent test revisions.
When credible research reports differences in the effects of construct-irrelevant variance across subgroups of test takers on performance on some part of the test, the test should be used if at all only for those subgroups for which evidence indicates that valid inferences can be drawn from test scores.
When credible research reports that differential item functioning exists across age gender, racial/ethnic, cultural disability, and/or linguistic groups in the population of test takers in the content domain measured by the test, test developers should conduct appropriate studies when feasible. Such research should seek to detect and eliminate aspects of test design, content, and format that might bias test scores for particular groups.
Test developers should strive to identify and eliminate language, symbols, words, phrases, and content that are generally regarded as offensive by members of racial, ethnic, gender, or other groups, except when judged to be necessary for adequate representation of the domain.
In testing applications involving individualized interpretations of test scores other than selection, a test taker's score should not be accepted as a reflection of standing on the characteristic being assessed without consideration of alternate explanations for the test taker's performance on that test at that time.
When empirical studies of differential prediction of a criterion for members of different subgroups are conducted, they should include regression equations (or an appropriate equivalent) computed separately for each group or treatment under consideration or an analysis in which the group or treatment variables are entered as moderator variables.
In testing applications where the level of linguistic or reading ability is not part of the construct of interest, the linguistic or reading demands of the test should be kept to the minimum necessary for the valid assessment of the intended construct.
When scores are disaggregated and publicly reported for groups identified by characteristics such as gender, ethnicity, age, language proficiency, or disability, cautionary statements should be included whenever credible research reports that test scores may not have comparable meaning across groups.
When tests or assessments are proposed for use as instruments of social, educational, or public policy, the test developers or users proposing the test should fully and accurately inform policymakers of the characteristics of the tests as well as any relevant and credible information that may be available concerning the likely consequences of test use.
When the use of a test results in outcomes that affect the life chances or educational opportunities of examinees, evidence of mean test score difference between relevant subgroups of examinees should, where feasible, be examined for subgroups for which credible research reports mean difference for similar tests. Where mean differences are found, an investigation should be undertaken to determine that such differences are not attributable to a source of construct under representation or construct-irrelevant variances. While initially the responsibility of the test developer, the test user bears responsibility for uses with groups other than those specified by the developer.
When a construct can be measured in different ways that are approximately equal in their degree of construct representation and freedom from construct-irrelevant variance, evidence of mean score differences across relevant subgroups of examinees should be considered in deciding which test to use.
The testing or assessment process should be carried out so that test takers receive comparable and equitable treatment during all phases of the testing or assessment process.

REFERENCES

Adsanatham, Chanon. (2012). Integrating assessment and instruction: Using student-generated grading criteria for multimodal digital projects. Computers and Composition, 29, 152–174.

AERA/APA/NCME. (1999). Standards for educational and psychological testing. Washington DC: American Educational Research Association.

Baldwin, Doug. (2012). Fundamental challenges in developing and scoring constructed-response assessments. In Norbert Elliot & Les Perelman (Eds.), Writing assessment in the 21st century: Essays in honor of Edward M. White (pp.327–343).Cresskill, NJ: Hampton Press.

Banks, Adam. (2006). Race, rhetoric, and technology: Searching for higher ground. Mahwah, NJ: Lawrence Erlbaum Associates.

Behizadeh, Nadia, & Engelhard, George. (2011). Historical view of the influences of measurement and writing theories on the practice of writing assessment in the United States. Assessing Writing, 16 (3), 189–211.

Berk, Ronald. (1982). Handbook of methods for detecting test bias. Baltimore, MD: Johns Hopkins University Press.

Broad, Bob. (2003). What we really value: Beyond rubrics in teaching and assessing writing. Logan: Utah State University Press.

Broad, Bob; Adler-Kassner, Linda; Alford, Barry; Detweiler, Jane; Estrem, Heidi; Harrington, Susanmarie; McBride, Maureen; Stalions, Eric; & Weeden, Scott. (2009). Organic writing assessment: Dynamic mapping in action. Logan: Utah State University Press.

Cambridge, Darren; Cambridge, Barbara; & Yancey, Kathleen Blake. (2009). Electronic portfolios 2.0: Emergent research on implementation and impact. Sterling, VA: Stylus.

Camilli, Gregory. (2006). Test fairness. In Robert L. Brennan (Ed.), Educational measurement (4th ed.; pp. 221–256). Westport, CT: American Council on Education and Praeger Publishers.

Camilli, Gregory, & Shepard, Lorrie A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage Publications, Inc.
Conference on College Composition and Communication. (2004). Position statement on teaching, learning, and assessing writing in digital environments. Retrieved from http://www.ncte.org/cccc/resources/positions/digitalenvironments

Coley, Toby F. (2012). Teaching with digital media in writing studies. New York: Peter Lang.

Common Core English Language Arts Standards. (2012). Retrieved from http://www.corestandards.org/

Compaine, Benjamin M. (2001). The digital divide: Facing a crisis or creating a myth. Boston, MA: MIT Press.

Condon, William. (2011). Reinventing writing assessment: How the conversation is shifting. WPA: Writing Program Administration, 34 (2), 162–182.

Condon, William. (2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: Fishing for red herrings? Assessing Writing, 18 (1), 100–108.

DeVoss, Dànielle N. & Porter, James E. (2006). Why Napster matters to writing: Filesharing as a new ethic of digital delivery. Computers and Composition, 23, 170–210.

Elbow, Peter, & Belanoff, Patricia. (2009). Portfolios as a substitute for proficiency examinations. In Brian Huot & Peggy O’Neill (Eds.), Assessing writing: A critical sourcebook. (pp.97–101).Boston, MA: Bedford/St. Martin’s.

Elliot, Norbert; Dees, Perry; Rudniy, Alex; & Joshi, Kamal. (2012). Placement of students into first-year writing courses. Research in the Teaching of English, 46 (3), 285–313.

Elliot, Norbert, & Perelman, Les. (2012). In context: The contributions of Edward M. White to the assessment of writing ability. In Norbert Elliot & Les Perelman (Eds.), Writing assessment in the 21st century: Essays in honor of Edward M. White (pp.1–24).Cresskill, NJ: Hampton Press.

Elliot, Norbert, & Williamson, David. (2013). Special issue on assessing writing with automated scoring systems. Assessing Writing, 18 (1).

ETS. (2002). Standards for quality and fairness. Princeton, NJ: Educational Testing Services.

Ericsson, Patricia, & Haswell, Richard. (2006). Machine scoring of student essays: Truth and consequences. Logan: Utah State University Press.

Feenberg, Andrew. (2005). Critical theory of technology: An overview. Tailoring Biotechnologies, 1 (1), 47–64.

Gallagher, Chris. (2009). What do WPAs need to know about writing assessment? An immodest proposal. Writing Program Administration, 33 (1–2), 29–45.

Gillborn, David, & Mirza, Heidi Safia. (2000). Educational inequality: Mapping race, class and gender (HMI 232) London: OFSTED.

Grabill, Jeffrey T. (2003). On divides and interfaces: Access, class, and computers. Computers and Composition, 20, 455–472.

Hanson, F. Allen. (1993). Testing testing: Social consequences of the examined life. Berkeley, CA: University of California Press.

Herrington, Anne, & Moran, Charles. (2001). What happens when machines read our students’ writing? College English, 63 (4), 480–499.

Herrington, Anne; Hodgson, Kevin; & Moran, Charles. (2009). Teaching the new writing: technology, change, and assessment in the 21st-century classroom. New York: Teachers College Press.

Herrington, Anne. & Stanley, Sarah. (2012). Criterion: Promoting the standard. In Asao B. Inoue & Mya Poe (Eds.), Race and writing assessment (pp. 47–62). New York: Peter Lang.

Hoffman, Donna L. & Novak, Thomas P. (1998). Bridging the racial divide on the Internet. Science, 280, 390–391.

Holmes, David G. (1999). Fighting back by writing Black: Beyond racially reductive composition theory. In Keith Gilyard (Ed.), Race, rhetoric, and composition (pp. 53–66). New Hampshire: Boynton/Cook.

Huot, Brian. (1990). Reliability, validity, and holistic scoring: What we know and what we need to know. College Composition and Communication, 41 (2), 201–213.

Huot, Brian. (1996a). Computers and writing assessment: Understanding two technologies. Computers and Composition, 13, 231–243.

Huot, Brian. (1996b). Toward a new theory of writing assessment. College Composition and Communication, 47 (4), 549–566.

Huot, Brian. (2002). (Re) Articulating writing assessment for teaching and learning. Logan: Utah State University Press.

Inoue, Asao B. (2009). The technology of writing assessment and racial validity. In Christopher S. Schreiner (Ed.), Handbook of research on assessment technologies, methods, and applications in higher education (pp. 97–120). Hershey, PA: IGI Global.

Inoue, Asao B. (2012). Grading contracts: Assessing their effectiveness on different racial formations. In Asao B. Inoue & Mya Poe (Eds.), Race and writing assessment (pp. 79–94). New York: Peter Lang.

Inoue, Asao B., & Poe, Mya. (2012). Racial formations in two writing assessments: Revisiting White and Thomas’ findings on the English Placement Test after thirty years. In Norbert Elliot and Les Perelman (Eds.) Writing assessment in the 21st century: Essays in honor of Edward M. White (pp.341–359).Cresskill, NJ: Hampton Press.

Jackson, Linda A.; Ervin, Kelly S.; Gardner, Philip D.; & Schmitt, Neal. (2001). The racial digital divide: Motivational, affective, and cognitive correlates of internet use. Journal of Applied Social Psychology, 31 (10), 2019–2046.

Johnson, David, & Van Brackle, Lewis. (2011). Linguistic discrimination in writing assessment: How raters react to African American “errors,” ESL errors, and standard English errors on a state-mandated writing exam. Assessing Writing, 17 (1), 35–54.

Jones, Steve; Johnson-Yale, Camille; Millermaier, Sarah; & Seoane Pérez, Francisco. (2009). U.S. college students’ internet use: Race, gender and digital divides, Journal of Computer-Mediated Communication, 14 (2), 244–264.

Kane, Michael. (2006). Validation. In Robert L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT: American Council on Education and Praeger Publishers.

Kelly-Riley, Dane. (2006). A validity inquiry into minority students’ performances in a large-scale writing portfolio assessment. Unpublished doctoral dissertation, Washington State University.

Kelly-Riley, Diane. (2011) Validity inquiry of race and shared evaluation practices in a large-scale, university-wide writing portfolio assessment. Journal of Writing Assessment, 4 (1). Retrieved from http://www.journalofwritingassessment.org/article.php?article=53

Kimme Hea, Amy. (2009). Going wireless: A critical exploration of wireless and mobile technologies for composition teachers and researchers. Cresskill, NJ: Hampton Press.

Klobucar, Andrew; Elliot, Norbert; Dees, Perry; Rudniy, Oleksandr; & Joshi, Kamal. (2013). Automated scoring in context: Rapid assessment for placed students, Assessing Writing, 18, 62–84.

Lane, Suzanne; Plake, Barbara; Herman, Joan; Cook, Linda; & Worrell, Frank. (2010). Fairness in testing. Presentation to the American Educational Research Association.

Lenhart, Amanda; Arafeh, Sousan; Smith, Aaron; & MacGill, Alexandra. (2008). Writing, technology, and teens. Washington, DC: Pew Internet and American Life Project.

Lenhart, Amanda. (2012, March). Teens, smartphones, and texting. Retrieved from
http://pewinternet.org/Reports/2012/Teens-and-smartphones.aspx

Lenhart, Amanda; Purcell, Kristen; Smith, Aaron; & Zickuhr, Kathryn. (2010, February). Social media and young adults. Retrieved from
http://pewinternet.org/Reports/2010/Social-Media-and-Young-Adults.aspx

Livingston, Gretchen. (2010). The Latino digital divide: The native born versus the foreign born. Washington, DC: Pew Research Center.

Madaus, George. (1993). A national testing system: Manna from above? An historical/technological perspective. Educational Measurement, 11, 9–26.

Madden, Mary; Cortesi, Sandra; Gasser, Urs; Lenhart, Amanda; & Duggan, Maeve. (2012, November). Parents, teens, and online privacy. Retrieved from http://pewinternet.org/Reports/2012/Teens-and-Privacy.aspx

Madden, Mary; Lenhart, Amanda; Duggan, Maeve; Cortesi, Sandra; & Gasser, Urs. (2013, March). Teens and technology 2. Retrieved from http://www.pewinternet.org/Reports/2013/Teens-and-Tech.aspx

McKee, Heidi A. (2008). Ethical and legal issues for writing researchers in an age of media convergence. Computers and Composition, 25, 104–122.

McKee, Heidi A. & DeVoss, Dànielle Nicole. (2007). Digital writing research: technologies, methodologies, and ethical issues. Cresskill, NJ: Hampton Press.

Messick, Samuel. (1989). Validity. In Robert L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education and Macmillan.

Messick, Samuel. (1990). Validity of test interpretation and use. Research report ETS-RR-90-11. Princeton, NJ: ETS.

Messick, Samuel. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23 (2), 13–23.

Monroe, Barbara. (2004). Crossing the digital divide: Race, writing, and technology in the classroom. New York: Teachers College Press.

Morain, Matt, & Swarts, Jason. (2012). YouTutorial: A Framework for Assessing Instructional Online Video. Technical Communication Quarterly, 21, 6–24.

Moss, Pamela A. (1992). Shifting conceptions of validity in educational measurement: Implications for performative assessment. Review of Educational Research, 62 (3), 229–258.

Murray, Elizabeth A.; Sheets, Hailey A.; & Williams, Nicole A. (2010). The new work of assessment: Evaluating multimodal compositions, Computers and Composition Online. Retrieved from http://www.bgsu.edu/cconline/murray_etal/index.html

NCTE. (2008). 21st century curriculum and assessment framework. Retrieved from http://www.ncte.org/positions/statements/21stcentframework

NCTE-WPA. (2008). White paper on writing assessment in colleges and universities. Retrieved from http://wpacouncil.org/whitepaper

Neal, Michael. (2011). Writing assessment and the revolution in digital texts and technologies. New York: Teachers College Press.

New London Group (1996). A pedagogy of multiliteracies: Designing social futures. Harvard Educational Review, 66 (1), 60–92.

Ochsner, Robert, & Fowler, Judy. (2012). Evaluating essays across institutional boundaries: Teacher attitudes toward dialect, race, and writing. In Asao B. Inoue & Mya Poe (Eds.), Race and writing assessment (pp. 111–126). New York: Peter Lang.

Office of Institutional research. (2011). 2011 Institutional profile. LaGuardia Community College, City University of New York. Retrieved from http://www.laguardia.edu/facts/factbooks.aspx

Office of Management and Budget. (1995). Standards for the classification of federal data on race and ethnicity. Federal Register, 60 (166), 44674–44693.

Omi, Michael, & Winant, Howard. (1994). Racial formations in the United States: From the 1960s to the 1990s (2nd ed.). New York: Routledge.

O’Neill, Peggy. (1998). Writing assessment and the disciplinarity of composition. Unpublished doctoral dissertation, University of Louisville.

O’Neill, Peggy. (2003). Moving beyond holistic scoring through validity inquiry. Journal of Writing Assessment, 1 (1), 47–65.

O'Neill, Peggy. (2011). Reframing reliability for writing assessment. Journal of Writing Assessment, 4 (1). Retrieved from http://journalofwritingassessment.org/archives.php?issue=11

Peckham, Irvin. (2012). Assessment and curriculum in dialogue. In Norbert Elliot & Les Perelman (Eds.) Writing assessment in the 21st century: Essays in honor of Edward M. White (pp.169–186).Cresskill, NJ: Hampton Press.

Powell, Annette. (2007). Access(ing), habits, attitudes, and engagements: Re-thinking access as practice. Computers and Composition, 24 (1), 16–35.

Redd, Teresa. (2003). “Tryin to make a dolla outa fifteen cent”: Teaching composition with the Internet at an HBCU. Computers and Composition, 20, 359–373.

Remley, Dirk. (2011). The practice of assessing multimodal PowerPoint shows. Computers and Composition Online. Retrieved from http://www.bgsu.edu/departments/english/cconline/CCpptassess/index.html

Remley, Dirk. (2012). Forming assessment of machinima video. Computers and Composition Online. Retrieved from http://www.bgsu.edu/departments/english/cconline/cconline_Sp_2012/SLassesswebtext/index.html

Rose, Mike. (1995). Lives on the boundary: A moving account of the struggles and achievements of America’s educationally unprepared. New York: Penguin.

Royer, Dan, & Gilles, Roger. (2003). Directed self-placement: Principles and practice. Cresskill, NJ: Hampton Press.

Selber, Stuart. (2004). Multiliteracies for a digital age. Carbondale: Southern Illinois University Press.

Selfe, Cynthia L. (1999). Technology and literacy in the twenty-first century: The importance of paying attention. Carbondale: Southern Illinois University Press.

Selfe, Cynthia L. (2007). Multimodal composition: Resources for teachers. Cresskill, NJ: Hampton Press.

Smitherman, Geneva. (1993, November 18). “The blacker the berry, the sweeter the juice”: African American student writers and the National Assessment of Educational Progress. National Council of Teachers of English Conference, Pittsburgh, PA.

Sorapure, Madeline. (2006). Between modes: Assessing student new media compositions. Kairos: A Journal of Rhetoric, Technology, and Pedagogy, 10 (2). Retrieved from http://kairos.technorhetoric.net/10.2/coverweb/sorapure/

Stanford Study of Writing. (2008). Retrieved from http://ssw.stanford.edu/

Takayoshi, Pamela. (1996). The shape of electronic writing: Evaluating and assessing computer assisted writing processes and products. Computers and Composition, 13 (2), 245– 257.

Warschauer, Mark. (2003). Demystifying the digital divide. Scientific American. Retrieved from http://www.scientificamerican.com/article.cfm?id=demystifying-the-digital

Warschauer, Mark, & Matuchniak, Tina. (2010). New technology and digital worlds: Analyzing evidence of access, use, and outcomes. Review of Research in Education, 34, 179–225.

Wei, Lu, & Hindman, Douglas B. (2011). Does the digital divide matter more? Comparing the effects of new media and old media use on the education-based knowledge gap. Mass Communication and Society, 14 (2), 216v235.

Weigle, Sara C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press.

Weigle, Sara C. (2013). English language learners and automated scoring of essays: Critical considerations. Assessing Writing, 18, 85–99.

White, Edward M., & Thomas, Leon L. (1981). Racial minorities and writing skills assessment in the California State University and colleges. College English, 43 (3), 276–283.

White, Edward M. (1994). Teaching and assessing writing. (2nd ed.). Portland, ME: Calendar Islands Publisher.

Whithaus, Carl. (2005). Teaching and evaluating writing in the age of computers and high-stakes testing. Mahwah, NJ: Lawrence Erlbaum Associates.

William, Michael, & Huot, Brian. (1993). Validating holistic writing assessment: Theoretical and empirical foundations. Cresskill, NJ: Hampton Press.

Wysocki, Anne Frances; Johnson-Eilola, Johndan; Selfe, Cynthia L; & Sirc, Geoffrey. (2004). Writing new media: Theory and applications for expanding the teaching of composition. Logan: Utah State University Press

Yancey, Kathleen Blake. (1999). Looking back as we look forward: Historicizing writing assessment College Composition and Communication, 50 (3), 483–503.

Yancey, Kathleen Blake. (2004). Looking for sources of coherence in a fragmented world: Notes toward a new assessment design. Computers and Composition, 21 (1), 89–102.

Yancey, Kathleen Blake. (2009). Reflection and electronic portfolios: Inventing the self and reinventing the university. In Darren Cambridge, Barbara Cambridge, & Kathleen Blake Yancey (Eds.), Electronic portfolios 2.0 (pp. 5–17). Washington, DC: Stylus.

Zdenek, Sean. (2009). Accessible podcasting: College students on the margins in the new media classroom. Computers and Composition Online. Retrieved from http://seanzdenek.com/article-accessible-podcasting/