Duplicate:
I'm developping an application that consists on: the user inputs an URL of some website, and then the application have to analyze that URL.
How can I have access to the HTML file, using Java? Does I need to use HttpRequest? How does that works?
Thks.
-
You could just use a URLConnection. See this Java Tutorial from Sun
-
You can use java.net.URL and then open an input stream to read the HTML from the server. See the example here.
-
URLConnection is fine for simple cases. When there are things like redirections involved, you are better off using Apache's HTTPClient
-
This code downloads data from a URL, treating it as binary content:
public class Download { private static void download(URL input, File output) throws IOException { InputStream in = input.openStream(); try { OutputStream out = new FileOutputStream(output); try { copy(in, out); } finally { out.close(); } } finally { in.close(); } } private static void copy(InputStream in, OutputStream out) throws IOException { byte[] buffer = new byte[1024]; while (true) { int readCount = in.read(buffer); if (readCount == -1) { break; } out.write(buffer, 0, readCount); } } public static void main(String[] args) { try { URL url = new URL("http://stackoverflow.com"); File file = new File("data"); download(url, file); } catch (IOException e) { e.printStackTrace(); } } }The downside of this approach is that it ignores any meta-data, like the Content-Type, which you would get from using HttpURLConnection (or a more sophisticated API, like the Apache one).
In order to parse the HTML data, you'll either need a specialized HTML parser that can handle poorly formed markup or tidy it first before parsing using a XML parser.
-
Funnily enough I wrote utility method that does just that the other week
/** * Retrieves the file specified by <code>fileUrl</code> and writes it to * <code>out</code>. * <p> * Does not close <code>out</code>, but does flush. * @param fileUrl The URL of the file. * @param out An output stream to capture the contents of the file * @param batchWriteSize The number of bytes to write to <code>out</code> * at once (larger files than this will be written * in several batches) * @throws IOException If call to web server fails * @throws FileNotFoundException If the call to the web server does not * return status code 200. */ public static void getFileStream(String fileURL, OutputStream out, int batchWriteSize) throws IOException{ GetMethod get = new GetMethod(fileURL); HttpClient client = new HttpClient(); HttpClientParams params = client.getParams(); params.setSoTimeout(2000); client.setParams(params); try { client.executeMethod(get); } catch(ConnectException e){ // Add some context to the exception and rethrow throw new IOException("ConnectionException trying to GET " + fileURL,e); } if(get.getStatusCode()!=200){ throw new FileNotFoundException( "Server returned " + get.getStatusCode()); } // Get the input stream BufferedInputStream bis = new BufferedInputStream(get.getResponseBodyAsStream()); // Read the file and stream it out byte[] b = new byte[batchWriteSize]; int bytesRead = bis.read(b,0,batchWriteSize); long bytesTotal = 0; while(bytesRead!=-1) { bytesTotal += bytesRead; out.write(b, 0, bytesRead); bytesRead = bis.read(b,0,batchWriteSize);; } bis.close(); // Release the input stream. out.flush(); }Uses Apache Commons library i.e.
import org.apache.commons.httpclient.HttpClient; import org.apache.commons.httpclient.methods.GetMethod; import org.apache.commons.httpclient.params.HttpClientParams;
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.