Abstract
With the popularization and development of internet in the past few decades, more and more electronic documents appear on the Internet. Numerous product specifications are available via Internet, eg available in the form of web pages or PDFs. This dissertation helps the company to automatically extract the products, product sepecifications and product restriction from the web site. In this paper, We research on the definition of product named entity, the construction of the corpus, and the recognition technologies. This work concerns the following aspects:
1. After studying many of product names in web pages, we define the various compositi- ons of product name entity. With this definition, we developed a rule for the corpus annotation. Then we create a product named entity corpus by using the semi-supervised method.
2. According to the features of the product names we divided the recognition of product names into two phases. The first phase detects the brand name, the series name and the type of a product. Based on the first results the product name will be recognised in the second phase. For the recognition in these two phases, many methods can be used. In this work we discuss hidden Markov model, maximum entropy model and Conditional Random Field model. After comparing these three models we decide to use conditional Random Field Model to do the recognition.
3. After the product names are successfully detected, the products, the product features and the restrictions between products will be extracted.
|