Jump to content

User:Proteins/Writing scripts for Wikipedia: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Proteins (talk | contribs)
explain more about the DOM tree
Proteins (talk | contribs)
more on the DOM tree
Line 21: Line 21:
</blockquote>
</blockquote>


Why are so many levels necessary before getting to the main article? The MediaWiki software uses these other levels to add all the extra decorations found on the page. For example, the user commands along the upper edge at the right, such as your user name, you user talk page, your preferences, etc. are found under "column-one" node, which is the sibling node of the "column-content" node. So are the tabs at the top of the page such as "article", "talk", "edit this page", etc. as well as the "portlet" menus for navigation, search, interaction and toolbox in the left-hand column. By placing these in a separate node, they can be located and manipulated inpendently from the content.
Why are so many levels necessary before getting to the main article? The [[MediaWiki|MediaWiki software]] uses these other levels to add all the extra decorations found on the page. For example, the user commands along the upper edge at the right, such as your user name, you user talk page, your preferences, etc. are found under "column-one" node, which is the sibling node of the "column-content" node. So are the tabs at the top of the page such as "article", "talk", "edit this page", etc. as well as the menus for navigation, search, interaction and toolbox in the left-hand column. By placing these in a separate node, they can be located and manipulated independently from the content.

Looking inside the bodyContent node using a DOM inspector reveals all the HTML code that makes up the article. For example, typical section headings are contained under H2 nodes, whereas successive subsections are contained under H3, H4 and H5 nodes. Normal text is contained in paragraph nodes labeled "P". Unordered (that is, bullet-pointed) lists and ordered (that is, numbered) lists are contained under UL and OL nodes, respectively; individual items in both cases are contained under LI (list item) nodes. Indentation corresponds to discursive lists; these are labeled with a DL, and the indented text is contained under a DD node. In some cases, a DL list is actually a definition list, one that has defined, boldfaced terms contained under a DT node; these terms are generated using an initial semicolon in wiki-markup. Larger-scale groupings of HTML nodes can be made using DIV and SPAN tags.

Revision as of 14:47, 30 October 2008

Scripts are amazing. They give you nearly unlimited power to analyze Wikipedia articles, to modify their appearance and even to add new elements. For example, you can count the number of polysyllabic words (analysis), color the words according to their syllables (modification) and create interactive dialogs for the reader (addition). In general, scripts do not affect the underlying article, the one stored on the database, so that multiple people can view the same article according to their own preferences, by using different scripts.

Scripts are also not hard to write! You need to know some HTML tags, and you have to learn how browsers represent the HTML internally, in a so-called DOM tree. Once you've learned those things, and mastered a few commands in JavaScript, you can do anything. There's already a WikiProject devoted to scripts, but this essay was written in the hope that Wikipedians might appreciate a slightly simpler introduction to scripts. Please let me know if something is unclear or incorrect.

How to access the DOM tree in your browser

The DOM tree is created by the browser, and most browsers allow you to see it. The following instructions should allow you to see it in different browsers:

  • In Firefox 3, the best approach is to download an add-on known as "DOM inspector". Once added, it should appear under the "Tools" menu in the top bar of the browser, which is next to the "Bookmarks" menu". DOM Inspector can also be activated using the keycode Ctrl-Shift-I.
  • In Google Chrome, right-clicking on any part of the page summons a menu. At the bottom of that menu is the choice "Inspect element", which shows the position of the element in the DOM tree.
  • In Internet Explorer 7, the Internet Explorer Developer Toolbar, a free download from Microsoft, is used to show the DOM tree. This toolbar can be found at the far right, behind the double arrows that are to the right of the "Tools" menu, which is itself to the right of the "Page" menu.
  • In Safari, click on the "Develop" menu and select the choice "Show Web Inspector". The Develop menu is located in the topmost menu bar, between the "Bookmarks" and "Window" menus. If the Develop menu is not there, click on the "Edit" menu and select its last element, "Preferences". A window will pop up, on which you choose the last tab, labeled "Advanced". At the bottom of the Advanced screen is a checkbox labeled "Show Develop menu in menu bar." Clicking this checkbox should introduce the Develop menu in the menu bar.
  • In Opera, the equivalent DOM inspector can be turned on by clicking on the "Tools" menu in the top menu bar (sandwiched between the "Widgets" and "Help" menus). Under the Tools menu, click on the "Advanced" submenu, and from the resulting sub-sub-menu, choose "Developer Tools". This should turn on an analysis system at the bottom of the screen, which incidentally can also be detached into a window of its own. Within this analysis window, clicking on the "DOM" tab should reveal the DOM tree. One drawback of this inspector seems to be that it does not reveal the changes in the DOM tree after your script has run. Instead, it reloads the webpage afresh, always showing the original unmodified DOM tree.

The DOM tree of typical Wikipedia pages

Inspecting the DOM tree of Wikipedia articles will reveal a common architecture. The main content of the article is contained inside a DIV element with the id label "bodyContent"; to reach this crucial node, however, you need to drill down a few levels. The bodyContent node is found under the "content" node, which in turn is under the "column-content" node, which in turn is under the "globalWrapper" node, which is turn is under the standard BODY node, which is under the HTML node, which is under the "document" node, the top of the DOM tree. Thus, to reach bodyContent, you need to follow the sequence of child-nodes (sometimes called a "trail" through the document, or an XPath)

document → HTML → BODY → globalWrapper → column-content → content → bodyContent

Why are so many levels necessary before getting to the main article? The MediaWiki software uses these other levels to add all the extra decorations found on the page. For example, the user commands along the upper edge at the right, such as your user name, you user talk page, your preferences, etc. are found under "column-one" node, which is the sibling node of the "column-content" node. So are the tabs at the top of the page such as "article", "talk", "edit this page", etc. as well as the menus for navigation, search, interaction and toolbox in the left-hand column. By placing these in a separate node, they can be located and manipulated independently from the content.

Looking inside the bodyContent node using a DOM inspector reveals all the HTML code that makes up the article. For example, typical section headings are contained under H2 nodes, whereas successive subsections are contained under H3, H4 and H5 nodes. Normal text is contained in paragraph nodes labeled "P". Unordered (that is, bullet-pointed) lists and ordered (that is, numbered) lists are contained under UL and OL nodes, respectively; individual items in both cases are contained under LI (list item) nodes. Indentation corresponds to discursive lists; these are labeled with a DL, and the indented text is contained under a DD node. In some cases, a DL list is actually a definition list, one that has defined, boldfaced terms contained under a DT node; these terms are generated using an initial semicolon in wiki-markup. Larger-scale groupings of HTML nodes can be made using DIV and SPAN tags.