York team takes top two spots in world programming competition

A team of York graduate students took on the world and won in competitions held for the 2005 Text Retrieval Conference (TREC) organized by the US National Institute of Standards and Technology (NIST).


Competing in the Genomics Track category, the York team’s two entries in the data retrieval programming competition took the top two places among 62 other submissions from all over the world. The results were announced at the conference, held in November 2005. To win the prizes graduate students Ming Zhong, Miao Wen, Mladen Kovacevic, Yan Huang and Kai Zheng of York’s Faculty of Science & Engineering had to design a program to handle 50 queries for a set of data that numbered more than 4.5 million pieces of information. The data was taken from the world’s most comprehensive source of life sciences and biomedical bibliographic information – and they did it without the aid of a biomedical science background.


“No one in our team has a biomedical background,” said faculty adviser Jimmy Huang, a professor in Atkinson’s School of Analytic Studies & Information Technology and the Graduate Program in Computer Science. “Those documents in the test set were not understandable to us. Nevertheless, we did very well with our knowledge on information retrieval and computer science in general.”


“Retrieving information by searching huge amounts of data is an extremely important task in many different spheres – witness the Internet as just one such context,” said Peter Cribb, Chair of the Department of Computer Science & Engineering in York’s Faculty of Science & Engineering. “York, through our Department of Computer Science & Engineering and through our Information Technology Program, has some world-leading expertise in this field and we are very proud of the accomplishments of these graduate students and Prof. Huang,” Cribb said.


The records used to supply the material for the competition are a subset of the full MEDLINE database, which is the world’s most comprehensive source of life sciences and biomedical bibliographic information, compiled by the US National Library of Medicine (NLM). A more detailed description of the data used in TREC can be found at http://ir.ohsu.edu/genomics/2005data.html.


For each TREC competition, NIST provided a test set of documents and queries. Participants ran their own retrieval systems on the data and returned to NIST a list of the retrieved top-ranked documents for each query. NIST pooled the individual results, judged the retrieved documents for correctness, and evaluated the results.


The York team’s method is an extension of the Okapi retrieval system originally developed by City University in London, UK, where Huang was a member of the research team doing his doctoral research. The York team designed some new algorithms and methods, such as those for extending biomedical query terms, building a dual index, weighting, and for adjusting the retrieval parameters. “These methods and algorithms work very well,” said Huang.


The initial set-up work was conducted in the summer before the competition data sets and queries were given out. The York team had a few weeks to work on it and submitted the retrieval results to NIST by the due date.


At the TREC 2004 competition, the York team achieved the best result in the HARD Track (High Accuracy Retrieval from Documents) at the passage level retrieval. This best result was based on 136 submissions from all over the world. The 2004 team included three of the 2005 team members, Yan Huang, Miao Wen and Ming Zhong.