Assignment 3 - Web services, screen scraping
1.
SearchProducts.asmx - This web service allows users to query the
products in your RetailStore database. The web service uses the same stored procedure
as
Search.aspx.
Steps:
- In the solution for your RetailStore add a new item "Web service."
Name it SearchProducts.asmx.
- Create a new WebMethod "SearchProducts" that uses a string parameter
named "query" and returns a DataSet.
- The next step is to use the stored procedure to fill a dataset. The class
handout
GetTitlesSP.cs.txt illustrates how to
programmatically call a stored procedure, pass in a parameter and
retrieve the data.
- The sample uses a stored procedure named "GetTitlesAZ" which you
will need to replace with the name of your stored procedure. The
sample also uses a parameter named "strQuery" which you will need to
change to the name of your parameter.
- The web service needs to return a DataSet rather than a
DataTable. DataTables cannot be returned via a WebMethod. See
the class handout
Web Services (.doc) to see how to add a datatable to a dataset.
- Absolute image path: Users of your web service may want to
display your product images and need to be provided with the full
path to your images. After populating the dataset iterate through it
and change the image name to an absolute image path. The code
looks something like this:
foreach (DataRow dr in ds.Tables[0].Rows)
{
dr["ImageName"] =
"http://yorktown.cbe.wwu.edu/Sandvig/mis424/Assignments/RetailStore/productImages/"
+ dr["ImageName"] + ".thumb.jpg";
}
2.
ProductSearchWS.aspx
- This page consumes
the web
service from the previous exercise and displays the results. Steps:
- To consume the web service created in the previous exercise it
needs to be located on a public web server. Copy it to your Yorktown
account and test it.
- Use VS to create proxy classes that represent the web service. In VWD
right-click on the root folder (C:\...) and select "Add Web
Reference." Paste the URL for your web service (on
Yorktown) into the
dialogue box and click go. Name it "SearchProductsWS" and click "Add
Reference."
- The previous step created classes that are proxies for the
remote web service. The web service is now part of your
application's class library and can be used like any other class.
- Add a new page to your RetailStore named ProductSearchWS.aspx.
Add a textbox, button and DataList control. You can copy the
ItemTemplate from default.aspx (remove the formatting for the image
since the full path will be provided).
- Populating the DataList: The DataList control may be populated
either programmatically using databinding or with a DataSource
control. The following instructions are for using a DataSource control.
- Click on the DataList's smart tag and select "Choose data
source."
- Choose data source type "Object" and click OK.
- From the drop down list of classes in your application select "SearchProductsWS.SearchProducts"
Click OK.
- Next configure the input and output parameters. In the "Define
Data Methods" box for "Select" choose "Search(String strQuery),
return DataSet". The name may be slightly different, depending upon
how you named your web service.
- Configure it to retrieve the query from the search textbox.
Provide a default value of gibberish that won't return anything.
This stops it from throwing an error when the page initially loads
and the parameter is null.
- Test it. When you copy the page to Yorktown you will also need to copy
the proxy "SearchProductsWS" from the App_WebReferences folder.
Screen Scraping
In some situations the data that we need (or want) is not available via web services or a RSS
feed. In these situations screen scraping may be used to programmatically retrieve
information from a web page. Search engines, meta-shopping
sites, spammers and others use screen scraping to collect information from
web sites.
Screen scraping has many legitimate uses, including: transferring
information between servers, communicating with Legacy systems, creating
indexes and communicating between incompatible technologies. The class
web site, for example uses screen scraping to display the class roster,
which for security reasons is located on a different server (source page
on Saratoga and the scraped
page displayed on Yorktown).
Page scraping is a two step process:
- Retrieve the desired web page from a remote server.
- Parse the returned HTML to "scrape"
out the relevant data. Capturing the desired data is typically
achieved using regular expressions.
Collecting data from another party's web site has the potential for
violating copyright and other property laws. Many web sites publish
their usage policies on their sites. You should read these policies
before collecting information from a web site. The legal issues
associated with using information obtained from web sites are complex
and beyond the scope of this class.
3.
ScrapeWWUNews.aspx - This page scrapes WWU news and displays the
links.
Copy
ScreenScrape.aspx (source)
and modify it to scrape the dates, headlines and links from the
WWU News page and
display the results neatly in a table.
Tips:
- The
MIS
424 Regular Expression page contains sample
regular expressions and links to a number of articles. The article
Microsoft Beefs Up VBScript with Regular Expressions is old but
is still a good beginning guide to regular expressions. Pay
particular attention to the concept of
non-greedy search.
- The sample code uses the .NET's "Singleline" regular expression
option. This changes the meaning of a dot "." from "match every
character except line breaks (\n)" to "match every character
including line breaks."
- Define a regular expression that captures the text for an event (date, title, URL).
Examine the HTML and
identify a distinctive pattern that uniquely identifies the text that
you want to capture.
- The next step is to parse out the information of interest.
Regular expressions are a good way to do this or it can also be done
using the C# string functions. The sample code includes an
example (commented out) showing how to use regular expressions to
parse specific information from a string.
- Your regular expression may include surrounding html tags that
need to be removed. Three options for cleaning the text are:
- Use another regular expression to find and replace the extra
text with an empty string.
- Use C# string language functions (.Replace(), .IndexOf(),
.LastIndexOf(), .Substring(), .Length(), etc.).
- Use regular expression lookahead (?=) and lookbehind (?<=) syntax to
include the HTML in the search but exclude it from the results.
- Relative file paths in
hyperlinks and images must often be changed to absolute file paths. The
C# string replace method works well for this.
- Your page should display the event date as shown in the example
including: item count, event date in bold text, event as a working
hyperlink. The results should be neatly displayed in a table.
4.
GoogleSearch -- Google provides a free API that
allows developers to utilize its powerful search engine. In this
exercise you will create a Google search page. Google's search API uses
AJAX to retrieve search results. Steps:
- Google the phrase "google search api"
- Under "Google AJAX search API" click on "Web Search Samples"
- Click on "Start using the Google AJAX Search API" and follow the
instructions for obtaining a API Key.
- Create a search page using the source code from the sample above
(with your key) or write your own code based upon the samples and
documentation provided by Google.
- Modify the formatting of the results using CSS. The HTML tags
are generated using JavaScript and consequently are not visible by
viewing the page source code. Use
FireBug to inspect the page elements in the Google results and
modify at least three of them. The following image illustrates how
to view the page elements. In this example, the mouse is hovering
over the snippet and firebug indicates that "gs-snippet" is the css
class used to format it.
It would be useful to display the number of results found by each
query but the API is fairly new and the EstimatedResultCount property
does not work properly yet.
To submit your assignment for grading send an email with the URLs for your assignment to:
- Professor Sandvig at
.
(note: this address is for homework assignments only - please send
other correspondence to
).
- cc. a copy to yourself.
The subject line of your email should read "MIS424 AXX YourName" where XX
is the assignment number. Please check that your URLs are correct before submitting them for grading.
Files with incorrect URLs will not be graded.