Wikipedia is a huge source of content for people who need big text data.
Wikipedia allow to get these data and provide database and documentation on this link.
My own experience allow to make such statements:
- It’s painfull to parse Wiki language (Wikecode or WikiText), even if many tools exist (Mylyn etc..), proprietary templates make substitutions not possible.
- The best Wikitext parser is Wikipedia itself…
- Creating a local Wikipedia mirror is painfull and poorly documented.
- Get parsed arcticles from Wikipedia website is not a problem when it is as slow as 1 article per second, but getting a large amout of articles is very long (1 article = 1 sec <=> 100000 articles = 1 whole day)
- Arcticles must be filtered because everything is in Wikipedia, best and worst.
I was able to get data from Wikipedia in these few steps:
- Prerequisites
MySQL server and tools
Java virtual machine (the path of java.exe must be in PATH environnement variable)
Custom Java tool WikiDump [ Sources - Compiled version ] (in the compiled version a batch is provided to launch a demo, an exception is thrown as expected but it doesn’t interrupt the process ) - Go to website http://download.wikimedia.org/enwiki/latest/ to get MySQL dumps:
: Contains a list of categories and arcticles count.
enwiki-latest-category.sql.gz: Makes the link between articles and categories.
enwiki-latest-categorylinks.sql.gz - Import theses scripts in a MySQL database (with Windows I used
MySQL Administrator).
Note: The process is long (many hours). - Filter wanted articles by executing the following query to get desired portals:SELECT * FROM category WHERE cat_pages > 0 AND cat_title LIKE “Portal:%” ORDER BY cat_pages DESCOnce portals choosen (
takeandrefuse), select included articles. Execute following query (put your own portals here):SELECT cl_data_in.cl_from FROM ((SELECT DISTINCT cl3.cl_from FROM categorylinks AS cl3 WHERE cl3.cl_to IN (“Portal:take1,Portal:take2″)) AS cl_data_in)
WHERE cl_data_in.cl_from NOT IN (SELECT DISTINCT cl2.cl_from FROM categorylinks AS cl2 WHERE LOWER(cl2.cl_to LIKE “liste d%”) OR cl2.cl_to IN (“Disambiguation_pages”,”Portal:refuse1″))Once article list is filtered, export result in csv file, not matter file format, the goal is to have an article id by line.
Cleanup the file, delete column name and empty lines.
- To launch process to get content, type in command line:
java -jar WikiDump.jar article_file_path output_directory > log.txt