后端爬虫培训

学到的东西

运用简单的正则表达式 if (scanner.hasNext(".*https://www.aixiaxsw.com/.*")) 来检索是否输入了正确的爱下书小说网子小说网页URL（正则表达式好难）
System.exit(status:); 可直接退出程序的运行
运用 try{};catch(){}; 来跳过可能出现的错误是程序继续运行
OutputStream 是 Java 中的字节输出流，它能用来将文件或者字符串输出到新的文件中

遇到的问题

通过 IDEA 打包的 Maven 项目 .jar 胖包是怎么运行的？
抓取错误中的 IOException ewww 错误是什么原理产生的错误？

项目代码

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;
import java.util.Scanner;

public class SpiderMan {

    public static void main(String[] args) throws IOException {
    String menuUrl = null;
    final String fileAddr = "./";
    Scanner scanner = new Scanner(System.in);
    System.out.println("请输入想要爬取的爱下书小说网中小说的URL");
    if (scanner.hasNext(".*https://www.aixiaxsw.com/.*")){
        menuUrl = scanner.next();
    }
    else {System.out.println("请输入正确的爱下书小说网的小说子网页URL喵");
        System.exit(111);
    }

        Document document = null;
        try {
            document = Jsoup.connect(menuUrl).get();
        } catch (IOException ewww) {
            ewww.printStackTrace();
        }
        String title = document.body().selectFirst("h1").text();
        System.out.println("开始爬取："+title);
        Elements menu = document.body().select("#list");
        Elements aURL = menu.select("a[href]");

        System.out.println("小说将保存在："+fileAddr + title+".md  中");
        File file = new File(fileAddr + title+".md");
        OutputStream fileOut = null;
        try {
            fileOut = new FileOutputStream(file);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        fileOut.write(("#"+ " " + title+"\n\n").getBytes());

        int count = 1;
        for (Element a : aURL){
            if(count<=6){
                count++;
                continue;
            }
            String mainLink ="https://www.aixiaxsw.com";
            String subLink = a.attr("href");
            String chapterName = a.text();
            System.out.println("当前爬取章节："+chapterName);
            Document chapter = null;
            try {
                chapter = Jsoup.connect(mainLink+subLink).get();
            } catch (IOException ewww) {
                ewww.printStackTrace();
            }
            Element chapterContent = chapter.selectFirst("#content");
            fileOut.write(("\n\n" + "##" + " " + chapterName + "\n\n " + chapterContent.wholeText()).getBytes());
        }

        System.out.println("小说爬取完成了！！！");
        fileOut.close();}
}

Ps：在 pom.xml 中引入了 jsoup 依赖

摸了摸了QAQ

Eclipse

软件园学生在线

【后端二】赵昱琨

后端爬虫培训

学到的东西

遇到的问题

项目代码

Ps：在 pom.xml 中引入了 jsoup 依赖

摸了摸了QAQ