Jump to page content
The Pequod
Dr Alistair Brown
Associate lecturer in English Literature; researching video games and literature

New Blog

Twitter @alibrown18

New Essay

Through exploring the psychopathology of Capgras syndrome, in which a patient mistakes a loved one for an imposter, The Echo Maker offers a sustained meditation on the ways in which we project our own problems onto other people. As a reflection on the mysteries of consciousness, the novel offers some interesting if not especially new insights into the fuzzy boundaries between scientific and literary interpretations of the mind. Read more

Online Text Databases and the Literary Canon


Although academic text archives make a wide range of texts readily available to scholars, and they can an disguise the categorical markers found in their physical counterparts, eliminating many of the paratextual features which position a work within a particular tradition of authorship and readership and enabling the new reader to approach with fewer preconceptions. However, because of the audience for whom the databases are compiled, electronic texts are still also anteriorly positioned within canonical traditions and, broadly, they reflect rather than affect existing literary-critical prejudices.


Although academic text archives make a wide range of texts (including otherwise scarce works) readily available to scholars, with their high subscription charges they remain closed to the public and, in order to retain their status as authoritative sources with the institutions from which they derive their income, the databases necessarily employ a range of exclusions, both in the texts they cover and in their organisation.1 Electronic texts can disguise the categorical markers found in their physical counterparts, eliminating many of the paratextual features which position a work within a particular tradition of authorship and readership and enabling the new reader to approach with fewer preconceptions: with the bland egalitarianism of vanilla plain text, there is no such thing as a popular edition or a prestige hardback. Danny Karlin, in compiling The Penguin Book of Victorian Verse (1997), used the Chadwyk-Healey English Poetry collection (at that time held on CD-Rom). He noted that:

English Poetry doesnt allow you to judge a book by its cover...The poets file past in alphabetical order, without notation of hierarchy, without biographical or critical tags, without distinction in the way they are treated. Since I hadn't the least idea who most of them were to start with, I came to their poetry without preconception except in terms offered by the poems themselves - their own declarations of intent as to genre, or verse-form, or subject, their proclamations of ideological allegiance, their more subtle signs of stylistic affinity.2

However, because of the audience for whom the databases are compiled, electronic texts are still also anteriorly positioned within canonical traditions and, broadly, they reflect rather than affect existing literary-critical prejudices.

Only in relatively prosaic ways, by enabling previously-impossible empirical studies, may databases encourage reconsiderations of the canon. Recently, electronic texts and databases have been used to evaluate readerships, uncover historical shifts in word usage, expose recurring features of the short story, reveal signs of Alzheimer's disease implicit in Iris Murdoch's novels, "find" unattributed work actually written by Henry James.3 These approaches place no special emphasis on the corpus of texts which conventionally have been judged to represent the unique style of an author or genre at its finest, and rely instead on the wide range of data searched. Studies like these may prompt aesthetic re-evaluations: knowing that certain of Murdoch's later novels were limited by medical factors beyond her artistic control may cause them to be criticised more sympathetically in relation to her earlier masterpieces; the discovery of "new" texts by an anonymous James will alter opinions of how his style developed to its culminating achievement. However, these empirical methods are pursued in the light of existing emphases of literary criticism. To study Murdoch or James implies an awareness that these authors are more significant than the many others who have suffered from mental illness or who published anonymously; the algorithms used to examine the genre of the short story were objective, but a human editor chose the 600 texts, which he believed constituted the short story, to come under their scope.

Locally, then, databases offer new mechanisms for objective interpretation; globally, they reflect existing canonical demarcations. For example, Literature Online (LION) could be used to research the prevalence of tropes of disease in literature of the nineteenth century. A search for "sickness or illness or sick or ill" (my area of interest) occurring in works published between 1800 and 1900 returns 14046 occurrences. Since each "hit" is displayed in its immediate textual context, it would be possible to scan through the results in a few days' intense reading, during which the search could be narrowed easily by excluding recurring but irrelevant usages, such as "ill-feeling." Repeating this process in libraries would take infinitely longer, since books must be obtained, opened and searched; consequently, the programme of study would involve collections of works already covering the theme or period, from which a sense of the dominance of the trope could be drawn out. With the LION approach, however, as well as finding widely-anthologised writers who are already recognised as dealing with illness - Poe, the Brontės, de Quincey - it also brings to attention writers who might not otherwise have been considered, such as Emerson Bennett and Sarah Hale.

However, as with the empirical studies, this highlighting of marginalised authors indicates nothing per se about their works' aesthetic qualities. Danny Karlin admitted that "Although I found many interesting and beautiful poems on the database, I did not discover another great poet. If there is an undiscovered Victorian poetic genius, he or she is not included in English Poetry (I didn't find one anywhere else, either)."4 Further, although unknown authors are presented cohabitant with the familiar on screen, the database as a whole has already made a significant subjective aesthetic judgement. Were LION to hold texts from the non-fictional context - patient records, medical research, collections of folk remedies - an overwhelming number of results would be returned. Thus, in order to retain its time-saving functionality with the literary scholars from whom it derives its income, LION is edited with a conscious bias towards works of literary merit. Although during the nineteenth century creative literature (as opposed to religious tracts, law books, educational texts, pamphlets and other ephemera) did not account for more than one third of printed material, it is this minority which is covered by LION.5 Literature remains defined and accessed as a canon of "literary" works with aesthetic value, rather than of all writing. Empirical methods do not make aesthetic judgements but, paradoxically, as the databases expand in their coverage, thereby enabling these more comprehensive and objective, search-based, approaches, subjective editorial inclusions and exclusions are the only thing which keeps the databases' information from overloading, and these judgements are often inspired using traditional anthologies and criticism. For example, the editors of LION used The Bibliography of American Literature - supplemented with additional poets brought forward by the editorial board - in determining which texts and authors to include in this section of the collection; for the English Poetry component, they used the New Cambridge Bibliography of English Literature (1969-72).6

Even with the sweeping exclusion of large amounts of literature judged to be non-literary, as the volume of information continues to increase, so too must the amount of metadata appended to it. In the continually tense drive to ensure the quality of results keeps pace with quantity, the apparatus which allows collections to be ordered sequentially (by date, name, length) also invites their restructuring through significant ideological paratexts which help the scholar to focus on their specific area of interest. For example, LION divides the twentieth-century poets into "American Poetry of the Postwar Period" and "African-American Poetry," and enables searches within these sub-categories. This maintains the distinctions made in physical collections such as the Norton Anthologies, which in turn reflect the break-up of courses ("African-American Studies," "American Studies") within the university and the classification of books within the library. Theoretically, the electronic text may be hermetically sealed from markers of privilege or existing status, and thus prevent preconceptions, which always already infect aesthetic judgements and criticisms, of where a text sits within a tradition. However, because of metadata (a kind of digital paratext), texts are arranged based on distinctions of gender, ethnicity or period, rather than only on the incontestable order of the alphabet.

Although they cover the literary canon with a comprehensiveness not found in any single anthology or library, electronic collections accurately reflect political factors in canon formation. Narrowing the search for writings on illness to female authors returns 2474 results, leaving 11572 works by male authors; women, apparently, account for the production of just 18% of creative writing concerning this subject. This significant difference might be rationalised as being because fewer women wrote about physical health than men, a revealing statistic which would raise an intriguing research question. However, the results are fully in line with the relative genders of authors working from 1800 to 1900, of whom around 19% were female. This figure was derived by Richard Altick using the Cambridge Bibliography of English Literature (1969-72), which provided details on 849 authors (about half that offered by LION for the corresponding period).7 In fact, therefore, it is the bias of the LION collection, which reflects the minority status of female writers in the nineteenth century, which accounts for the discrepancy in writings on illness. The awareness that the database provides unprecedented coverage compared to the traditional anthology should not lead to the impression that it does anything other than accurately reflect the sociology of literary writing in the nineteenth century, one which was equally well-reflected by the less comprehensive CBEL volumes thirty years before the advent of the computerised collection.

Thus, it is less in the academic databases as they are edited at present, and more in the area of electronic self-publishing, where political biases - introduced by the presence of an editor and the particular readership for which he edits - may be reduced. Historically, economic factors have been significant in canon formation: texts have dropped out of consideration because all their manuscripts have been destroyed (or had limited dissemination), whilst the reliance on patronage and publishers establishes a relationship between writer and the audience for which he consciously writes. However, with new copyright policies (such as Creative Commons) ensuring that no author or publisher possesses control of the electronic text as an artefact, a work published on an open web page theoretically is available to as many people as have access to the Internet.8 If the issues of archiving and "future-proofing" that information can be overcome, potentially every work to be published in the future may be preserved in digital form, providing a permanent corpus from which works can be studied retrospectively.9 As with the academic databases, works published online are not significantly distinguished by their physical frames. But in publicly-accessible websites a new marker - one which is based on aesthetic judgements rather than on using ideologically-infected metadata to structure collections - substitutes for the loss of these indicators of status. In public text archives, or online bookshops such as Amazon, texts can be ranked according to their popularity (the number of times they have been downloaded), and the ratings provided by previous readers.10 This foregrounds the processes of aesthetic interpretation which have been pressed behind the scenes in the academic archives.

However, the electronic medium bears only the promise of reducing economic and political factors in canon formation, since the popular proclamation that the texts found on the Internet (either within academic databases or in informal online publications) democratically represent, or are equally accessible to, the body of global authors and readers, is a fallacy. The "democracy" of the web and the "accessibility" of the Internet are rhetorical phrases commonly proclaimed by the media and political institutions to describe the globalisation of information; however, these claims, when used in a literal sense, are not endorsed statistically: fewer people worldwide can access the internet, with its online books, than have access to paper media through public libraries.11 When the domain names of government and company websites of the United States are the only ones without an national suffix (as in .uk or .fr), and the national suffixes of countries such as Tuvalu (.tv) are bought up by Western media companies, it is clear that the Internet, though a distributed system without an organising centre as such, is still heavily influenced by the traditional geographies of capitalism and politics. In the future, as e-learning technology is sponsored by major international organisations such as UNESCO the internet may become open for all, whilst improvements in automatic translation technology may mean that the notion of a national canon distinguished by the writing in a particular language, accessible only to those fluent in it, may become obsolete. These possibilities, however, are in advance of the technological capacity and the political will of the present moment. The genuine power of the Internet as it stands today is that once a user does have access to it, there is a massively wider range of information available than would be obtainable from any library which a reader, academic or otherwise, could visit in person. But with this increased amount of information, the ideologies, concepts of the canon, and social pressures that structure it at present are preserved, even with an increased influence, not eliminated.

Top of Page


  1. Literature Online, ed. Dan Burnstone, vers. 05.1, Durham University Library, 6 Mar. 2005 <http://lion.chadwyck.co.uk> Early English Books Online, Durham University Library, 6 Mar. 2005 <http://eebo.chadwyck.com/>; Eighteenth Century Collections Online, 6 Mar. 2005 <http://www.gale.com/EighteenthCentury/index.htm>. [Back to text]

  2. Danny Karlin, "Victorian Poetry and the English Poetry Full-Text Database: A Case Study," Literature Online, ed. Dan Burnstone, vers. 05.1, 2 Mar. 2005 <http://lion.chadwyck.co.uk>. [Back to text]

  3. The Reading Experience Database, ed. Mary Hammond and Simon Eliot, British Library and Open University, 6 Mar. 2005 <http://www.open.ac.uk/Arts/RED/>; Richard Harp, "Using Literature Online to Analyse Historical Word Usage," Literature Online, ed. Dan Burnstone, vers. 05.1, 1 Mar. 2005 <http://lion.chadwyck.co.uk/infoCentre/casestudies.jsp#harp>; Helmut Bonheim, The Narrative Modes: Techniques of the Short Story (Cambridge: DS Brewer, 1992); Peter Garrard, Lisa M. Maloney, John R. Hodges, and Karalyn Patterson, "The Effects of Very Early Alzheimer's Disease on the Characteristics of Writing by a Renowned Author," Brain 128.2 (2004): 250-260; Thomas Jones, "Short Cuts," rev. of The Uncollected Henry James: Newly Discovered Stories, by Floyd Horowitz London Review of Books 23 Sept. 2004. [Back to text]

  4. Danny Karlin, "Victorian Poetry and the English Poetry Full-Text Database: A Case Study," Literature Online, ed. Dan Burnstone, vers. 05.1, 2 Mar. 2005 <http://lion.chadwyck.co.uk>. [Back to text]

  5. Simon Eliot, Some Patterns and Trends in British Publishing, 1800-1919 (London: The Bibliographical Society, 1994) 58. This bias is a particular problem of the LION database, which unlike more specialist collections seeks to cover the full period of writing. In this respect, EEBO, with its focus falling on the fifteenth to eighteenth centuries, when the proportion of creative literature was much smaller, has been able to digitise a greater range of works, including religious tracts, letters, legal documents, as well as drama, poetry and fiction. Thus EEBO may encourage scholars of the period to make broader interpretations of what constitutes literature, whilst it also attracts scholars with broadly socio-historical, as well as purely literary-critical, interests. [Back to text]

  6. "Information Centre: Content and Editorial Policy - Literature Collections," Literature Online, ed. Dan Burnstone, vers. 05.1, Durham University Library, 15 Mar. 2005 <http://lion.chadwyck.co.uk/infoCentre/editpolicy2.jsp>. [Back to text]

  7. Richard D. Altick, "The Sociology of Authorship: The Social Origins, Education and Occupations of 1,100 British Writers, 1800-1900," Bulletin of the New York Public Library 66 (1962): 389-404, at 392. [Back to text]

  8. Creative Commons, ed. Lawrence Lessig, 2002, 8 Mar. 2005 <http://creativecommons.org>. [Back to text]

  9. For a description of these issues, and the strategies by which they may be resolved, see Digital Preservation, 2005, US Library of Congress, 8 Mar. 2005 <http://www.digitalpreservation.gov/>. [Back to text]

  10. For a fairly comprehensive list of such sites, see "Online Writing," Open Directory, 15 Mar. 2005 <http://dmoz.org/Arts/Online_Writing/>. [Back to text]

  11. According to a survey conducted by the Graphic Visualisation and Usability Centre, 92% of net users in 1998 were based in Europe or America: "GVUs 10th WWW User Survey 1998," GVU's WWW User Surveys, ed. Jarek Rossignac, October 1998, Georgia Institute of Technology, 2 Mar. 2005 <http://www.gvu.gatech.edu/user_surveys/survey-1998-10/graphs/general/q50.htm>. In comparison, the distribution of libraries around the world shows only 63% of libraries are found in Europe and America: "Worldwide Guide to Libraries," 1 Stop Data Limited, 2 Mar. 2005 <http://www.1stopdata.com/datacard_worldwide_guide_to_libraries.htm>. Although only a very rough measure, the extensive gulf between the two figures strongly implies that the printed word is more fairly accessible to a global audience than the electronic. That is not to detract, however, from the genuine power of the Internet, which is that once a user does have access to it, there is a greater range of information available than would be obtainable from any library which a reader could visit in person. [Back to text]

Top of Page

Works Cited

Top of Page

Your Comments on "Online Text Databases and the Literary Canon"

To add your thoughts about this page, use the comment form below.

Top of Page

Top of Page

This page was published on March 9, 2005 | Keywords: etexts, canon, online texts, ebook

The content of this website is Copyright © 2016 using a Creative Commons Licence. One term of this copyright policy is that plagiarism is theft. If you use information from this website in your own work, you should use the correct citation.

Valid XHTML 1.0. Link opens in a new browser window. Level A conformance icon, W3C-WAI Web Content Accessibility Guidelines 1.0. Labelled with ICRA. Link opens in a new browser window.