- Download and install the latest version of TeXoo
mvn clean installto install the required dependencies.
- Create the PostgreSQL-Database based on the schema in
The dataset-generation consists of multiple steps which are separated into different Java-Scripts.
For the classifier training it is required to run the script
/src/main/java/classifier/TrainingClassifier.java "your output directory of the classifier" It will download the required documents from the database and start the training.
After the classifier has been trained it can be applied to all PubMed-documents in the database. Run
To export the articles marked as relevant run the script
/src/main/java/exporter/XMLExporter.java "testdata-directory" "traindata-directory" This will export the XML-documents of the test-dataset and the 50.000 most relevant documents into the defined directories.
The XML-data now has to be transformed into TeXoo-JSON. Therefore please run
src/main/java/cleaner/Exporter.java "path to the xml-test-documents" "path to the xml-train-documents" "testdata-output-file" "traindata-output-file"
src/main/java/matcher/WikiSectionMatcher.java "testdataset-path" "traindataset-path" "testdataset-output-path" "traindataset-output-path" to find the corresponding WikiSection-Labels for each Section and to save the final dataset.
Only a subset of 2200 documents is used for SentEval training. Therefore we generate randomly sample documents out of the large dataset. It is also required to pretokenize the dataset with the TeXoo-Tokenizer. This can be done with