Archive for the ‘misc’ Category

WikiDump: Tool to get Wikipedia content

Wednesday, June 30th, 2010

WikiDump tool has been made to get Wikipedia content from a list of article ids. The tool connect on http://en.wikipedia.org to get text data.

For more informations about getting Wikipedia content, you can see this post.

Here is the source code:


package com.devbypractice;

import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.List;

import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public class WikiDump
{
 /**
 * Get xhmlt content of Wikipedia source to String
 *
 * @param sUrlSource Url Source
 * @param sEncoding Encoding
 * @return xhtml content
 * @throws Exception
 */
 public static String dumpWikiXhtmlToString(String sUrlSource, String sEncoding)
 throws Exception
 {
 String inputLine;

 URL url;
 url = new URL(sUrlSource);
 URLConnection urlConn = url.openConnection();

 // TimeOut 30sec
 //
 urlConn.setConnectTimeout(30000);
 BufferedReader in = new BufferedReader(new InputStreamReader(
 urlConn.getInputStream(), sEncoding));

 StringBuilder sb = new StringBuilder();

 // XML header
 //
 sb.append("<?xml version=\"1.0\"?>");
 sb.append("<WikiContent>");

 // Copy response to buffer
 //
 while ((inputLine = in.readLine()) != null) {
 sb.append(inputLine);
 sb.append("\n");
 }
 in.close();

 // XML footer
 //
 sb.append("</WikiContent>");

 return sb.toString();
 }

 /**
 * Load article ids from files File must have an id on each line.
 *
 * @param sFileName
 * @return Articles ids
 * @throws Exception
 */
 public static List<Integer> loadArticleIdsFromFile(String sFileName)
 throws Exception
 {
 List<Integer> articleIds = new ArrayList<Integer>();

 FileReader reader = new FileReader(sFileName);
 BufferedReader buffReader = new BufferedReader(reader);

 String sLine = "";

 while ((sLine = buffReader.readLine()) != null) {
 Integer id = Integer.parseInt(sLine);
 articleIds.add(id);
 }

 buffReader.close();

 return articleIds;

 }

 /**
 * Display usage
 */
 public static void displayUsage()
 {
 System.out.println("Arguments:");
 System.out.println("<Article id source file path> <Output directory>");
 System.out.println("");
 System.out.println("Notes:");
 System.out.println("Article id file must have an id on each line. ");
 System.out.println("Output directory must exist.");

 }

 /**
 * Text cleanup. References [12], [3] etc... are deleted.
 *
 * @param text Text to clean
 * @return Cleaned text
 */
 public static String cleanup(String text)
 {
 return text.replaceAll("\\[[0-9]+\\]", "");
 }

 /**
 * Main entry
 *
 * @param args Arguments
 */
 public static void main(String[] args)
 {
 try {

 // UTF-16 chars count for output files
 // 1000000 ~= 2 Mo
 //
 final int fileSizeLimit = 1000000;

 // Current output file size
 // set to MAX + 1 for the file to be created at first time
 //
 int currentFileSize = fileSizeLimit + 1;

 // Url to get online Wikipedia article by id
 //
 String sWikiUrlFormat = "http://en.wikipedia.org/w/index.php?action=render&curid=%d";

 FileOutputStream outStream = null;
 OutputStreamWriter writer = null;

 if (args.length != 2) {
 displayUsage();

 }
 else {

 // Get article ids from file
 //
 List<Integer> articleIds = loadArticleIdsFromFile(args[0]);

 // Output dir
 //
 String sDestinationDir = args[1];

 // For each article id
 //
 for (Integer articleId : articleIds) {

 try {

 if (currentFileSize > fileSizeLimit) {
 if (writer != null) {
 writer.close();
 }
 outStream = new FileOutputStream(sDestinationDir + "\\"
 + articleId + ".txt");
 writer = new OutputStreamWriter(outStream, "UTF-16");

 currentFileSize = 0;

 }

 // Build article url
 //
 String sWikiUrl = String.format(sWikiUrlFormat, articleId);

 // Get article content
 //
 String sWiki = dumpWikiXhtmlToString(sWikiUrl, "UTF-8");

 // Load Xhtml content
 //
 InputSource is = new InputSource(new StringReader(sWiki));
 Document doc = DocumentBuilderFactory.newInstance()
 .newDocumentBuilder()
 .parse(is);

 // writer.write("========= " + articleId + " ==========\n");

 // Target desired wikipedia content
 //
 NodeList nl = XPathProcessor.getInstance().EvaluateMutli(
 "/WikiContent/p", doc);

 // For each node
 //
 for (int i = 0; i < nl.getLength(); i++) {
 String sContent = "";
 try {
 Node node = nl.item(i);

 // get text content
 //
 sContent = node.getTextContent();

 // clean text
 //
 sContent = cleanup(sContent);

 // Write in output file
 //
 writer.write(sContent);
 writer.write("\n\n");
 writer.flush();

 // Increase output file size count
 //
 currentFileSize += sContent.length() + 2;
 }
 catch (Exception e2) {
 String sLog = "Paragraph not get " + i
 + " for id=" + articleId;
 System.out.println(sLog);
 }
 }
 }
 catch (Exception e3) {
 String sLog = "Article not get for id = " + articleId
 + "\n";
 System.out.println(sLog);

 e3.printStackTrace();
 }

 }
 writer.close();
 }
 }

 catch (Exception e) {
 e.printStackTrace();

 }

 }

}

To compile the source you will need a XPath query manager available on this link.

A compiled version of WikiDump can be downloaded here.

Get Wikipedia content

Tuesday, June 29th, 2010

Wikipedia is a huge source of content for people who need big text data.

Wikipedia allow to get these data and provide database and documentation on this link.

My own experience allow to make such statements:

  • It’s painfull to parse Wiki language (Wikecode or WikiText), even if many tools exist (Mylyn etc..), proprietary templates make substitutions not possible.
  • The best Wikitext parser is Wikipedia itself…
  • Creating a local Wikipedia mirror is painfull and poorly documented.
  • Get parsed arcticles from Wikipedia website is not a problem when it is as slow as 1 article per second, but getting a large amout of articles is very long (1 article = 1 sec <=> 100000 articles = 1 whole day)
  • Arcticles must be filtered because everything is in Wikipedia, best and worst.

I was able to get data from Wikipedia in these few steps:

  1. Prerequisites
    MySQL server and tools
    Java virtual machine (the path of java.exe must be in PATH environnement variable)
    Custom Java tool WikiDump [ Sources - Compiled version ] (in the compiled version a batch is provided to launch a demo,  an exception is thrown as expected but it doesn’t interrupt  the process )
  2. Go to website http://download.wikimedia.org/enwiki/latest/ to get MySQL dumps:
    enwiki-latest-category.sql.gz
    : Contains a list of categories and arcticles count.
    enwiki-latest-categorylinks.sql.gz
    : Makes the link between articles and categories.
  3. Import theses scripts in a MySQL database (with Windows I used MySQL Administrator).
    Note: The process is long (many hours).
  4. Filter wanted articles by executing the following query to get desired portals:SELECT * FROM category WHERE cat_pages > 0 AND cat_title LIKE “Portal:%” ORDER BY cat_pages DESCOnce portals choosen  (take and refuse), select included articles. Execute following query (put your own portals here):

    SELECT cl_data_in.cl_from FROM ((SELECT DISTINCT cl3.cl_from FROM categorylinks AS cl3 WHERE cl3.cl_to IN (“Portal:take1,Portal:take2″)) AS cl_data_in)
    WHERE cl_data_in.cl_from NOT IN (SELECT DISTINCT cl2.cl_from FROM categorylinks AS cl2 WHERE LOWER(cl2.cl_to LIKE “liste d%”) OR cl2.cl_to IN (“Disambiguation_pages”,”Portal:refuse1″))

    Once article list is filtered, export result in csv file, not matter file format, the goal is to have an article id by line.

    Cleanup the file, delete column name and empty lines.

  5. To launch process to get content, type in command line:
    java -jar WikiDump.jar article_file_path output_directory > log.txt

Older JVM : JRE, JDK and more Java archives

Tuesday, May 11th, 2010

See post in french.

Older JVM : JRE, JDK and more can be found at:

http://java.sun.com/products/archive/

It’s a real museum of antiquities and a must have bookmark to be able to reproduce customer’s Java configurations.