MIS 424 - E-Commerce Systems Management Spring 2009

Last updated: 10/7/2009 2:00:26 PM

Assignment 3 - Web services, screen scraping

1. SearchProducts.asmx - This web service allows users to query the products in your RetailStore database. The web service uses the same stored procedure as Search.aspx.

Steps:

  1. In the solution for your RetailStore add a new item "Web service." Name it SearchProducts.asmx.
  2. Create a new WebMethod "SearchProducts" that uses a string parameter named "query" and returns a DataSet.
  3. The next step is to use the stored procedure to fill a dataset. The class handout GetTitlesSP.cs.txt illustrates how to programmatically call a stored procedure, pass in a parameter and retrieve the data. 
  4. The sample uses a stored procedure named "GetTitlesAZ" which you will need to replace with the name of your stored procedure. The sample also uses a parameter named "strQuery" which you will need to change to the name of your parameter.
  5. The web service needs to return a DataSet rather than a DataTable. DataTables cannot be returned via a WebMethod. See the class handout Web Services (.doc) to see how to add a datatable to a dataset.
  6. Absolute image path: Users of your web service may want to display your product images and need to be provided with the full path to your images. After populating the dataset iterate through it and change the image name to an absolute image path.  The code looks something like this:

    foreach (DataRow dr in ds.Tables[0].Rows)
    {
    dr["ImageName"] = "http://yorktown.cbe.wwu.edu/Sandvig/mis424/Assignments/RetailStore/productImages/" + dr["ImageName"] + ".thumb.jpg";
    }

2. ProductSearchWS.aspx - This page consumes the web service from the previous exercise and displays the results. Steps:

  1. To consume the web service created in the previous exercise it needs to be located on a public web server. Copy it to your Yorktown account and test it.
  2. Use VS to create proxy classes that represent the web service. In VWD right-click on the root folder (C:\...) and select "Add Web Reference."  Paste the URL for your web service (on Yorktown) into the dialogue box and click go. Name it "SearchProductsWS" and click "Add Reference."
  3. The previous step created classes that are proxies for the remote web service. The web service is now part of your application's class library and can be used like any other class.
  4. Add a new page to your RetailStore named ProductSearchWS.aspx. Add a textbox, button and DataList control. You can copy the ItemTemplate from default.aspx (remove the formatting for the image since the full path will be provided).
  5. Populating the DataList: The DataList control may be populated either programmatically using databinding or with a DataSource control. The following instructions are for using a DataSource control.
    1. Click on the DataList's smart tag and select "Choose data source."
    2. Choose data source type "Object" and click OK.
    3. From the drop down list of classes in your application select "SearchProductsWS.SearchProducts" Click OK.
    4. Next configure the input and output parameters. In the "Define Data Methods" box for "Select" choose "Search(String strQuery), return DataSet". The name may be slightly different, depending upon how you named your web service.
    5. Configure it to retrieve the query from the search textbox. Provide a default value of gibberish that won't return anything. This stops it from throwing an error when the page initially loads and the parameter is null.
  6. Test it. When you copy the page to Yorktown you will also need to copy the proxy "SearchProductsWS" from the App_WebReferences folder.

Screen Scraping

In some situations the data that we need (or want) is not available via web services or a RSS feed. In these situations screen scraping may be used to programmatically retrieve information from a web page.  Search engines, meta-shopping sites, spammers and others use screen scraping to collect information from web sites.

Screen scraping has many legitimate uses, including: transferring information between servers, communicating with Legacy systems, creating indexes and communicating between incompatible technologies. The class web site, for example uses screen scraping to display the class roster, which for security reasons is located on a different server (source page on Saratoga and the scraped page displayed on Yorktown).

Page scraping is a two step process:

  1. Retrieve the desired web page from a remote server.
  2. Parse the returned HTML to "scrape" out the relevant data. Capturing the desired data is typically achieved using regular expressions.

Collecting data from another party's web site has the potential for violating copyright and other property laws. Many web sites publish their usage policies on their sites. You should read these policies before collecting information from a web site. The legal issues associated with using information obtained from web sites are complex and beyond the scope of this class.
 

3. ScrapeWWUNews.aspx - This page scrapes WWU news and displays the links.

Copy ScreenScrape.aspx (source) and modify it to scrape the dates, headlines and links from the WWU News page and display the results neatly in a table. Tips:

  1. The MIS 424 Regular Expression page contains sample regular expressions and links to a number of articles. The article Microsoft Beefs Up VBScript with Regular Expressions is old but is still a good beginning guide to regular expressions. Pay particular attention to the concept of non-greedy search.
  2. The sample code uses the .NET's "Singleline" regular expression option. This changes the meaning of a dot "." from "match every character except line breaks (\n)" to "match every character including line breaks."
  3. Define a regular expression that captures the text for an event (date, title, URL). Examine the HTML and identify a distinctive pattern that uniquely identifies the text that you want to capture.
  4. The next step is to parse out the information of interest. Regular expressions are a good way to do this or it can also be done using the C# string functions.  The sample code includes an example (commented out) showing how to use regular expressions to parse specific information from a string.
  5. Your regular expression may include surrounding html tags that need to be removed. Three options for cleaning the text are:
    1. Use another regular expression to find and replace the extra text with an empty string.
    2. Use C# string language functions (.Replace(),  .IndexOf(), .LastIndexOf(), .Substring(), .Length(), etc.).
    3. Use regular expression lookahead (?=) and lookbehind (?<=) syntax to include the HTML in the search but exclude it from the results. 
  6. Relative file paths in hyperlinks and images must often be changed to absolute file paths. The C# string replace method works well for this.
  7. Your page should display the event date as shown in the example including: item count, event date in bold text, event as a working hyperlink. The results should be neatly displayed in a table.

4. GoogleSearch -- Google provides a free API that allows developers to utilize its powerful search engine. In this exercise you will create a Google search page. Google's search API uses AJAX to retrieve search results.  Steps:

  1. Google the phrase "google search api"
  2. Under "Google AJAX search API" click on "Web Search Samples"
  3. Click on "Start using the Google AJAX Search API" and follow the instructions for obtaining a API Key.
  4. Create a search page using the source code from the sample above (with your key) or write your own code based upon the samples and documentation provided by Google. 
  5. Modify the formatting of the results using CSS. The HTML tags are generated using JavaScript and consequently are not visible by viewing the page source code. Use FireBug to inspect the page elements in the Google results and modify at least three of them. The following image illustrates how to view the page elements. In this example, the mouse is hovering over the snippet and firebug indicates that "gs-snippet" is the css class used to format it.

 

It would be useful to display the number of results found by each query but the API is fairly new and the EstimatedResultCount property does not work properly yet.

 

 

 


To submit your assignment for grading send an email with the URLs for your assignment to:

  1. Professor Sandvig at . (note: this address is for homework assignments only - please send other correspondence to ).
  2. cc. a copy to yourself.

The subject line of your email should read "MIS424 AXX YourName" where XX is the assignment number. Please check that your URLs are correct before submitting them for grading. Files with incorrect URLs will not be graded.