* SLAAP * * about the project *

Please Note: This page hasn't been updated for quite a while (sorry!). You may want to access Tyler's (2007) PWPL paper for a more recent introduction.

introduction

The North Carolina Sociolinguistic Archive and Analysis Project (NC SLAAP) is a research and preservation initiative being conducted jointly between the North Carolina Language and Life Project (NCLLP) and the North Carolina State University Libraries. The NCLLP is a sociolinguistic research initiative at North Carolina State University (NCSU) with one of the largest audio collections of sociolinguistic data on American English in the world. It consists of approximately 1,500 interviews conducted from the late 1960s up to the present, most on analog cassette tape, but some in formats ranging from reel-to-reel tape to digital video. The collection features the interviews of Walt Wolfram, Erik Thomas, Natalie Schilling-Estes, Kirk Hazen, and numerous other scholars. (For more information about the NCLLP see http://www.ncsu.edu/linguistics/.) The NCSU Libraries have joined with the NCLLP to support and advance the digitization and preservation of this extensive audio collection.

In brief, NC SLAAP has two core goals: (1) to preserve the NCLLP's recordings through digitization; and (2) to enable and explore new computer-enhanced techniques for sociolinguistic analysis.

The digitization of this archive is major task in its own right with important dividends. For analysts, the centralization of the digitization process ensures a consistent method as opposed to an ad hoc approach to digitization that scholars are often forced to follow, digitizing particular tapes (or parts of tapes) as needed for specific projects. Meanwhile, from the Libraries' perspective, digitizing the audio collection of the NCLLP is an important task in itself on the grounds that it can increase access (where permissible) to this significant linguistic corpus. It's also an opportunity for the Libraries to gain experience in the digitization, storage, and access management of a large audio collection with a relatively high level of use. Many archivists and librarians also believe that digitization is a good method of preserving an audio collection. Academic libraries may still be less expert than some commercial organizations when it comes to digitizing and storing audio, but they may be even less equipped to maintain analog audio collections properly (cf. Brylawski 2002, Smith, Allen, and Allen 2004). Some archivists and librarians have pointed out that digitization and storage of audio may not be worth the expense and difficulty if the sole goal is preservation (cf. Puglia 2003). However, when scholarly digital projects can contribute to the advancement of a discipline - as in the case of NC SLAAP - the advantages seem well worth the investment.

Beyond the archive, the features of the SLAAP software have potentially tremendous implications for a wide range of linguistic approaches.
screen shot of SLAAP library view
Figure 1: SLAAP Library Browse View
Some examples of this are discussed below (see in particular the transcription theory and method section).

While many feature sets are still under development, SLAAP, even in its current state, provides a range of tools that greatly enhance the usability of the audio data. These features include a browsable and searchable interface to the archive collection (see Figure 1), an audio player with an annotation tool that allows users to associate searchable notes to specific times within the audio files (and to return to those particular passages at the click of the mouse), an audio extraction feature that allows users to download and analyze particular segments of audio files without having to download or locally store large files, sophisticated transcript display options, extensive search functionality, and some corpus-like analysis tools. (A select set of the software features are detailed on the features page.)

Importantly, the SLAAP software helps to address questions around the representation and tabulation of (socio)linguistic data, because it enables scholars to better access, check, and re-check their (and their colleagues') variable tabulations, analyses, and conclusions. In a sense, NC SLAAP is a test case for new ways of approaching linguistic analysis, using computers to maintain a strong tie between the core audio data and the analysts' representations of it.

[ back to top ]

theoretical raison d'etre: the representation of linguistic data

An empirical discipline like sociolinguistics [...] requires, as a basic tenet, the continual questioning of assumptions and postulates. (Chambers 2003: 224)

Traditional methods in sociolinguistic analysis have often relied on the repeated close listening of a set of audio recordings counting the number of times particular linguistic variants occur in lieu of other variants (a classic sociolinguistic example is the tabulating of words using final -in' for final -ing; cf. Fischer 1958, Trudgill 1974, Horvath 1985, etc.). These tabulations are normally recorded into a spreadsheet using a program such as Microsoft Excel, or even just onto a hard copy tabulation sheet. The results are then presented as summaries in publications or conference papers as the "data" used for description, explanation, and theory building. Some approaches in linguistics, such as discourse analysis, rely heavily on the development of transcripts of the audio recordings and often focus analyses on the transcripts themselves and not the original recording or interview event. However, scholars following a wide variety of sociolinguistic approaches have repeatedly highlighted the confounds that arise from these treatments of pseudo-data (i.e., analysts' representations of the data) as data. Linguists such as Blake (1997) and Wolfram (e.g., 1993) have discussed problems relating to the tabulation and treatment of linguistic variables and raised the issue that individual scholars' methods are often not comparable. In discussing transcription theory, Edwards has repeatedly pointed out that "transcripts are not unbiased representations of the data" (Edwards 2001: 321). In general, the understanding that linguistic data is more elusive than traditional "hard science" data is widespread but not acted upon. NC SLAAP represents our effort to overcome some of these long standing confounds and to argue that computer-enhanced approaches can propel sociolinguistic methodology into a new, more rigorous era.

Transcription theory and method, and the intervention NC SLAAP is attempting to make, are further discussed in the next section.

[ back to top ]

transcription theory and method

Treating transcription as primarily (if not solely) a technical procedure is a regression to the stance of naive realism with which the first photographs were viewed. Videocameras with microphones have replaced the camera with its lens, and "nature records itself" on magnetic tape. And then, as if we were printing positives from negatives, we inscribe the sounds in writing. Each of these steps of re-presentation is a transformation and each may be made in many different ways. (Mishler 1991: 261)

Improvements to the traditional text transcript are extremely important because the transcript is often the chief mediating apparatus between theory and data in language research. Language researchers have long been concerned with the best method and format for transcribing natural speech data (cf. Bucholtz 2006, Du Bois 2006, Edwards 2001, Edwards and Lampert 1993, Mishler 1991) and how best to analyze existing transcripts (e.g., Miethaner 2000). Researchers frequently incorporate a number of different transcription conventions depending on their specific research aims. Discourse analysts (e.g., Ochs 1979) traditionally focus most heavily on transcription as theory and practice, but researchers studying language contact phenomena (e.g., Poplack 1980) also have their own transcription conventions for analyzing and presenting their data. At the other end of the spectrum are variationists and dialectologists, who also use transcripts, even if often only for presentation and illustration.

Despite the importance of the transcript for most areas of linguistics, little work has been done to enhance the usability and flexibility of our transcripts. Yet the way a researcher builds a transcript has drastic effects on what can be learned from it (Edwards 2001). Concerns begin with the most basic decision about a transcript: how to lay out the text. Further decisions must be made throughout the transcript-building process, such as decisions about how much non-verbal information to include and how to encode minutiae such as pause-length and utterance overlap. Furthermore, the creation of a transcript is a time- and energy-intensive task, and researchers commonly discover that they must rework their transcripts in mid-project to clarify aspects of the discourse or speech sample.

The SLAAP software seeks to improve the linguistic transcript by moving it closer to the actual speech that it ideally represents without, necessarily, attempting to encode transcriber interpretations in to the transcript itself (Kendall 2005).
screen shot of line analysis example showing pitch track
Figure 2: Transcript Line Analysis, with Audio Player, Spectrogram, and Graph of Pitch
In the SLAAP system, transcript text is treated as annotations on the audio data: transcript lines are based on phonetic utterance (silence-speech-silence). These lines are stored in the database - each as its own entry - and directly tied to the audio file through timestamping of utterance start and end times. Transcript information can be viewed in formats mimicking those of traditional paper transcripts, but can also be displayed in a variety of dynamic ways - from the column-based format discussed by Ochs (1979) to a finer-level focus on an individual utterance complete with phonetic information (as shown in the Figure 2).

While many approaches to transcription theory focus on ways of marking up and tagging transcripts to build accurate representations for the project at hand (cf. discussions in Edwards 2001, Ochs 1979). SLAAP deals with transcripts differently and instead treats them as simplistically as possible, more as links to the actual audio than attempts at representation on their own. (For a more technical discussion of the treatment of transcripts in SLAAP, see the transcript section of the features page.)

[ back to top ]

future directions

NC SLAAP is hoped to be just the beginning of a long term exploration of new, computer-enhanced approaches to the study of spoken language in general and sociolinguistic analysis in particular. In many ways, this project started as simply an attempt at proof of concept - with the hope that by reorganizing and rethinking our linguistic data and representations of that data we can push the field forward. It is perhaps nascent enough to be considered still in this proof of concept phase. However, the benefits of an approach like SLAAP seem clear. Even ignoring the features that attempt to improve existing approaches or develop new analytic techniques (see the brief analysis section of the features page), the improvements that SLAAP provides over the traditional alternatives for linguistic data management (like audio tapes in a cabinet) are far reaching. Digital and locationally unconstrained access to the archive alone is of great benefit to researchers.

If we are able to find ways to improve methods and analytic rigor - and I think we are - these improvements should open up new avenues for theoretical advances in the field. To be more confident with our treatment of our data and others' data (in terms of how the data are represented and tabulated, as discussed above) will allow us to move our thinking in new directions beyond concerns about method and accuracy.

The software and archive are finally developed to the point where they can be used by scholars for analysis. However, the SLAAP software is still poised for further development and exploration along many different lines. It's possible to imagine new software features of all sorts that would be beneficial to researchers (automated/semi-automated part of speech (POS) tagging, better phonetic analyses, user-to-user messaging tools to share, record, and track research notes and communications, and so on) and continued programming work is planned at least into the near future. 2006, I hope, will see a number of research projects and papers by myself and other members of the NCLLP which utilize aspects of NC SLAAP.

As of this writing, perhaps only 10% to 15% of the NCLLP's interviews are digitized and in the archive. This number is admittedly small, but, I think, does not take away from the claim that the project is making steps in the right direction. In fact, the new research angles that the project opens up (for example, I'm currently working on an quantitative analysis of pause using NC SLAAP), even with this somewhat small archive, is exciting.

Finally, outside funding and inter-institutional partnerships could enable a larger, more comprehensive digitization venture. I hope to move in these directions in the near future. It's possible to imagine a comprehensive archive of sociolinguistic interviews beyond the holdings of the NCLLP. Through the Internet, this archive could really be comprised of many small archives, each housed by its own institution or scholar but still accessed through the same web interface - a true digital gateway to language variation.