Saturday, May 27, 2006

Proposal for a Guided Book Reading Web Application


When I came across an online chapter of a book, in which footnotes are added to words, with a translation of these words in the margin, I got the following idea.

Wouldn't it be nice, if there was a web application that takes a text from, say, Project Gutenberg or Wikisource, and then creates such a margin automatically? For this one needs to know for each word how difficult it is or how early it should be learned. This could be edited manually. One could then also specify the difficulty, by setting a threshold parameter called minimal difficulty. A next step would be to include sayings and the like, but this is of course much more difficult and not really necessary.

Before I want to start implementing it, I would like to have some feedback from other people.

The Texts

First of all there are several questions we need to ask ourselves concerning the text.

  • What should the format of the text be? It is hard to start from an unstructured text and many texts are in HTML, therefore it might be a good idea to start from HTML/XHTML.

  • Examples of interesting free texts can be found at:
    1. Project Gutenberg: contains many documents in HTML, ASCII and Plucker (probably the biggest and therefore the most interesting). Also Project Gutenberg can be downloaded as a whole, which gives us the possibility not to use too much of their bandwidth.

    2. Wikisource: contains quite some documents in many languages, but the text is rather unstructured.

Rated Word List

There are different approaches to make a rated word list, and here I shall discuss the one that I think is the best.
  • Rather than using a word list based on frequency, it is better to make the system for a rated word list by hand. A list of frequencies is not precisely what one wants. What one really wants is a rating that tells you how early you should learn the word. Because a list of frequencies is static, it does not leave space for improvement. When one rates every word by hand the list will gradually improve. Moreover different people can rate, and therefore a combined averaged list will represent a general opinion. Furthermore it is not that much work to get the most important words rated, especially not when one does it while reading.

  • My idea is the following: We make a scale from 1 to 10, in which 1 corresponds to the words one should learn first and 10 to the most unused words. For each number we then make a list of sample words that can be used as a reference, for instance in the English language:
    1. the, a, an, it, ...

    2. before, under, after, because, ...

    3. etc.

    There is also the possibility of putting all articles in category 1, all prepropositions category 2 etcetera, thereby simplifying deciding for some words in which category they belong. Of course I chose the number 10 rather arbitrary here, anyone a better suggestion?

  • There might be an advantage in keeping user-accounts with user-rated lists. A general rated list can then be formed from these user-rated lists. Furthermore each user can keep track of a list of words that might have a very rating but that the user just happened to know. If translations of these words just keep showing up, then the user can remove this redundant information by adding them to his known-words list.

  • Sometimes a word might not be in the dictionary. Then we can show the word in red and give the user the opportunity to either add a translation of the word to the dictionary (and give it a rating), or add the word to the known-words list.

The Dictionary

This section has contains some thoughts about what kind of dictionary we can use and where we can find it.
  • It would save some work in making the rated list if would find a way to deal with the morphology of words, that is the different forms of verbs, the adjective-adverb by adding or removing 'ly' etc. The key thing is here to identify a group of words that, due to some abstract grammatical relation, necessarily must have the same rating. The best thing would be a dictionary that has this information, and helps in this way identifying words with a different form.

  • We can use several dictionaries at the same time, since we'll digest them into a word list with translations (and morphology). For each dictionary we can write a separate harvesting program.

  • There are several things we are looking for in the dictionary:
    1. It should be possible to digest the right information from it. That is, word translations and possibly morphology.

    2. It should be free, allowing us to use it.

    3. It would be nice if it supported many languages.

    4. It would be nice if it was relatively complete.

  • Possible dictionaries are:
    1. Wiktionary

    2. Loco

    3. Dictionaries on Gutenberg

The Screen Layout

In this section I want to discuss the layout of the screen. This is important because a reader will spend a lot of time looking at it.
  • We need the following elements on the screen:
    1. Text block: a block with the foreign text that we want too read.

    2. Translation block: a block containing the translations of the difficult words.

    3. threshold field: a field in which the user can specify the desired threshold.

    4. Dictionary/language button: a button where the reader can specify the language of the text.

  • We need the following elements in a pop-up edit window.
    1. Sample words block: a block containing sample words to guide the user while choosing a category.

    2. A rating field: a field in which the user rating of the word can be specified, together with the common rating.

    3. Translation field: a field that contains the possible translations of the word.

    4. Isknown field: a field with a boolean value that states if the word is known or unknown to the user.

  • Should we go for a dynamic or a static layout? That is, should the user decide where to place the blocks? I don't think this is necessary, because there is a small and fixed number of blocks.

  • The text block should be rather narrow for easy reading.

  • If the text in the text block is chopped up into pages that fit on the screen, than it is easier to read, and will also make sure that the text and the translations won't start to walk out of line. On the other hand it is much easier to make a layout that does not chop the text up in pages. Therefore I propose to start with the latter, and change to the former when everything works.

  • The words with rating higher than the threshold should be marked with a 'footnote,' a number in super script. This won't be too distracting and links the words with their translations.

  • We have several different types of words occurring in the text. This is a proposal for distinguishing between these words (see the picture):
    1. Unrated words that appear in the dictionary (gray).

    2. Unrated words that don't appear in the dictionary (gray, underlined).

    3. Rated words whose rating is lower than the threshold (regular).

    4. Rated words whose rating is higher than the threshold (footnote).

  • The block with sample words in the category can be placed in a pop-up, thereby only showing them when it is necessary.

  • The following picture shows a proposal for a screen layout.

General Remarks

There are some general remarks that I would like to make.
  • For the accessibility I want it to be a web application. Does anyone have an idea what programming tools would be suitable? Do I hear AJAX?
  • It might be rather easy to make printable PDF-versions, by exporting to a TeX-file that adds the translations as footnotes at the bottom of the page. This of course misses the dynamical options of rating words and changing the threshold on the spot, but might serve as a prototype while 'getting things work.'

  • We could make a showcase of books that have been read thoroughly before, and therefore have no unrated and unknown words and maybe an audio file.

  • Gutenberg also contains many human and computer read audio books. We could, for such books, add a button which allows one to read and listen at the same time.

  • It might be easier to start with an English language and dictionary.

  • If this idea would get very popular, people might submit their own texts. These could, in some cases, be sent to either Project Gutenberg or Wikisource. More importantly, translations of words in the dictionary could be sent back to the dictionaries we use.

A Road Map

In this section I want to propose a road map that will lead to a fruitful product in the end, but also gives us something working in the beginning.
  1. A version in the English language that exports to TeX.

  2. A web application that doesn't permit changing the dictionary, changing the threshold, adding a rating, doesn't chop up the text into pages

  3. A web application that allows for changing the dictionary, the threshold, the rating, chops up the text into pages, adds a button to play an optional accompanying audio file from Project Gutenberg.


Several questions to get some input.
  1. Which dictionaries are suitable to use?

  2. Where can we find suitable online free texts?

  3. Does anyone have a different idea of the screen layout?

  4. What is the language in which the application should be written?

  5. Is there anything else we need to ask ourselves?

Labels: ,


At 6:11 PM, Anonymous gmlk said...

A few preliminary notes:

The only long text that is more or less available in many languages seem is to be the bible. But any text could be used to seed the word list, and then add to them with the data you can collect from usage of this webapp.

Use unicode.

What is a word? How is "a word" defined and recognized? How to recognize base words and words with related meanings?

Words have meaning which is context sensitive: Maybe store important surrounding words as well?
How to extract these?

Difficulty is mostly related to word length counted in syllables. Any internal dictionary should include the division in syllables of any word. A extra modifier value could be used to correct long words which are considered easy; although I doubt this would be needed.

Let users add translations/definitions to the dictionary. These additions are appended at the end; Other users may raise or lower the position of any entry.

Let users create links between definitions.

Dynamic page. AJAX.

I would advice against using TEX. You can beter use a self designed markup language which is optimized for this job. Consider markdown of textile as starting points.

Text is often divided in to paragraphs or verses; Chopping up should respect paragraphs.

I would choose ruby on rails for this application.

At 1:10 PM, Blogger Georg Muntingh said...

Now that is some valuable feedback!

@Bible: I do not see the advantage of a text being available in many languages? Or has it something to do with the dictionary? I would also say that the data collected from the user should be separated from the texts, except perhaps for the data that some text is thoroughly read before and therefore contains no unknown/unrated words: a showcase.

@Unicode: Absolutely.

@Word context: I don't want to think about this immediately. At first I want to start with just using words consisting of letters, not spaces or hyphens or anything of the like. This seems to be "good enough" to start with.

@Word difficulty in syllables: An eye-opener to me! This could be a very good candidate for an initial rating and seems much better than my previous idea of word frequency.

@Raising and lowering position of an entry: I'm not sure what you mean with this, the rating?

@Links: I'm not sure how exactly to implement this, but I think it is a great idea. Maybe the following. Notice that although there will be a lot of groups of connected words, the number of words in each group is very small (say < 10). Therefore I could just create for every word in a group a link to every other word in the group, giving an inequality #links < 10*#words + 1.

@TeX: The reason I thought about exporting to TeX is because it is so easy to generate PDF files without having to bother about the layout. I don't think it would be very difficult for ordinary texts: h1 tag --> \section{...} etc.

@Ruby on Rails: I thought about using the Google Web Toolkit, because they claim to simplify/avoid a lot of technicalities like browser incompatabilities.

At 3:39 PM, Anonymous Gerhard said...

Just generating an html with the rated words as acronyms would be very easy.

see: Acronyms defined in html

At 1:43 PM, Blogger Georg Muntingh said...

@Georg: Google Web Toolkit sounds like a bad idea. I've been working with Ruby on Rails the last 24 hours, and Ruby deals even better with AJAX issues and is superior to JAVA

@Gerhard: That sounds like a great idea for a proof of concept. For a final version I think it will be to cumbersome to move one's mouse to obtain a translation, instead of one's eyes.

At 7:34 PM, Anonymous Gerhard said...

You're probably right about usability. Explanations can be nicely displayed in the right margin using floating divs.

At 11:53 AM, Anonymous Anonymous said...

Help Me Im IN DEBT

Are you tired of hearing those words. Are you tired of your boss and the people at your work . Looking for a opportunitiy to home work business opportunities. then visit


Post a Comment

<< Home