Topical web-page classification

[READ THE FULL PAPER]

web classification demo image

Predicting the topics of web-pages based on their textual content turns is interesting and difficult. There is a lot of existing research giving broad overviews of what has been done, as well as specifics of particular techniques(i.e. document summarization techniques or page hierarchy inference).

I attempt to reproduce and combine many of these results from existing literature. This post contains a short summary of my results.

Setup

I worked with a subset of the DMOZ dataset, which is an XML document containing over 3,600,000 web pages labeled with their topic (i.e. Arts/Anime/Japanese) and a short description of the website. I used an open-source parser to convert the data to a nicer JSON format. Then I trained a classifier using Mallet, an open-source language toolkit.

Goals

My goal was to look at existing techniques in web-page classification and run them myself to see how they compared. I was also interested in comparing the difference in classification performance on web-pages and on summarized versions of the web-pages.

Working with summaries offers two key benefits:

- More condensed and relevant language
- Less space required to represent a web-page

I ended up evaluating the effects of varying n-gram sizes and information-gain on both web-pages and descriptions of web-pages. I also experimented with various web-page summarization techniques.

N-gram features

The most basic way to represent a document is as a bag of words. This means you just store the words that appear in the document without preserving the order they appeared in. For example, “Apples to Apples is fun” might get converted to {to: 1, Apples: 2, fun: 1, is:1}.

Bag of words is simple, intuitive, and occupies a relatively small feature space (proportional to the size of your vocabulary).

An n-gram is a document featurization that chains together every n consecutive items in a sequence. For example, a 2-gram (often called bigram) representation of “Apples to Apples is fun” might be {"Apples to": 1, "to Apples": 1, "Apples is": 1, "is fun": 1}.

The n-gram feature space grows exponentially with n, meaning that training a classifier takes a lot longer.

In my experiments, I found no accuracy benefit for using n-gram analysis, though I did find that training time more than doubled.

Information gain

Information-gain is the technique of taking highly dimensional vectors and projecting them into a lower dimensional space, while preserving as much of the information as possible. This turns out to be very useful for very sparse, high-dimensional vectors that come up frequently when modeling the features of text documents.

With information gain, I reduced training time for my classifier by over 80% while preserving near-equivalent accuracy results. This is super cool because it means you’re representing the same amount of real information with less actual space.

Web-page summarization techniques

The first technique, that vastly outperformed all others, was handwritten summaries (provided in the dataset). These achieved a maximum 79.129% accuracy across all experiments.

I compared seven automated summarization techniques to see how each performed. None exceeded the accuracy of no-summarization, but some came reasonably close (i.e. Luhn and lexical semantic analysis). The best summarization technique was still around 3% less accurate than no summarization, and around 8% less accurate than handwritten summarization. All summarization techniques cut training time by around 80%.

Conclusions

I found that some results stated in literature can be quite hard to reproduce. I did not manage to increase baseline accuracy via applying any of the techniques I read about. I did, however, manage to preserve equivalent accuracy, while cutting training time by several orders of magnitude.

You can read the full paper here.

Text-backdrop: contextualize any text

Text-backdrop is a tool I developed as my contribution to to Well Versed, a hackathon project that won third place at Hack Dartmouth 2015. Well Versed is a browser extension that, when activated, aggregates all relevant information about the topic of the current web-page and displays it all in a sidebar. Thus, anyone can become well versed about any topic they are perusing–politics, science, etc.

Text-backdrop is the part of the project that analyzes the given source text and gathers additional information. It uses natural language processing techniques to extract relevant information of the text, then it asynchronously queries several APIs (wikipedia, Bing news/images, etc.) to aggregate and return all contextually relevant information in a single JSON document.

Where to find text-backdrop

You can try it out on NPM, or check out the source on github. To play around with it locally, you simply need to npm install text-backdrop and you have it!

Writing find with Java 8 streams

Intro

In this guide, we’re going to write our very own simplified find command using Java 8 streams. By the end we will be able to run it from the command line, just like the unix find program.

Pre-reqs

This post assumes you at least a little familiar with Java 8 streams and regular expressions. We’re actually going to review find, so it’s ok if you do not yet know it.

What is find?

According to wikipedia, “find is a command-line utility that searches through one or more directory trees of a file system, locates files based on some user-specified criteria and applies a user-specified action on each matched file.”

So if we wanted to find all the markdown files in the kahlil.me directory, we could run:

find . -name "\*.md"

and we would get as a result:

./posts/2015-07-18-hello-world.md
./posts/2015-07-18-java-8-stream-sql.md
./posts/2015-07-21-java-8-find-and-grep-with-streams.md
./about.md
./archive.md
./LICENSE.md
./README.md

To keep things simple, our find won’t be able to perform any user-specified actions on the files it finds. Instead, it will simply take a starting directory and a pattern, then print out the full paths of any files that match the pattern and are somewhere in that directory.

Find implementation

find method

This is where all of our find logic resides. It turns out to be pretty straightforward.

We’re going to use java.nio.Files::walk to recursively traverse all subdirectories of the starting directory we pass it. Then, we’re going to filter the results and only keep the files that match the pattern we passed.

(note: I’ve omitted the throws clause to keep the code clean).

public static Stream<Path> find(String dir, String pattern) {
    return Files.walk(Paths.get(dir))
                .filter(path -> path.getFileName()
                                    .toString()
                                    .matches(pattern));
}

main method

All we need to do in our main method is pass the command line arguments to find, and print the results.

public static void main(String[] args) throws IOException {
    find(args[0], args[1]).forEach(System.out::println);
}

Running find

Let’s use Find.java to find all css files that style this very website.

java Find ~/dev/projects/kahliloppenheimer.github.io ".*\.css"
/Users/kahliloppenheimer/dev/projects/kahliloppenheimer.github.io/_site/assets/css/browser-support.css
/Users/kahliloppenheimer/dev/projects/kahliloppenheimer.github.io/_site/assets/css/main.css

Writing grep with Java 8 streams

Intro

We’re going to write our very own grep command using Java 8 streams. By the end, we will be able to run it on the command line, just like the normal UNIX grep program.

Pre-reqs

This post assumes you have some familiarity with Java 8 streams and a basic understanding of regular expressions. We’re actually going to review grep, so it’s ok if you do not yet know it.

What is grep?

According to wikipedia, “grep is a command-line utility for searching plain-text data sets for lines matching a regular expression.” And if you were wondering, grep stands for globally search a regular expression and print.

For example,

grep .*public.* MyJavaClass.java 

might yield

public class MyJavaClass {
public static void main(String[] args) {
public static final int MY_INT_CONST = 3;

As input, grep takes a pattern and some text to search (maybe a file, maybe the output of another program). As output, it then prints out all lines of the input text that match the pattern.

Our grep implementation will only take files as input, and will only use Regular Expressions as patterns.

Not so bad, right?

Grep implementation

The grep method

This is where we will write the bulk of the logic of our program. With streams, the method body is just one line, styled across two. (note: I omitted the throws clause to keep the code clean).

public static Stream<String> grep (String pattern, String fileName) {
    return Files.lines(Paths.get(fileName))
                .filter(line -> line.matches(pattern));
}

Fortunately, Java 8 streams shipped with some pretty sweet I/O functions, like java.nio.Files::lines(). This function takes a file name and returns a stream of the lines of that file.

All we need to do after that is filter those lines by matching on our pattern (built into Java.lang.String), and we’re done!

The main method

Now, we can give our Grep class a main method so that we can call it from the command line.

public static void main(String[] args) throws IOException {
    grep(args[0], args[1]).forEach(System.out::println);
}

All we’re doing here is calling our grep function, then printing out each line in the stream that gets returned.

The whole program

Putting it all back together, our grep program in its entirety looks like:

public class Grep {

    public static void main(String[] args) throws IOException {
        grep(args[0], args[1]).forEach(System.out::println);
    }

    public static Stream<String> grep(String pattern, String fileName)
            throws IOException {
        return Files.lines(Paths.get(fileName))
                    .filter(line -> line.matches(pattern));
    }

}

Running our grep program

Now, we can find all the lines that have the word ‘public’ in Grep.java, using Grep.java.

java Grep ".*public.*" Grep.java

and see as output:

public class Grep {
    public static void main(String[] args) throws IOException {
    public static Path getPath(String fileName) throws IOException {
    public static Stream<String> grep(String pattern, String fileName)

SQL on Java Collections with streams

Intro

We’re going to look at an example of how to represent SQL queries over java collections using streams from Java 8. Pretty cool right?

Pre-reqs

This post assumes you have some familiarty with basic SQL and Java 8 lambda expressions and streams.

Getting our hands dirty

Let’s say we have a list of key-value pairs defined to mark the number of bugs that each team member has squashed. Think of it like a map that can contain duplicate key values. We can define it and initialize it as such:

// Lets use a stream in an SQL-esque kind of way
        List<SimpleEntry<String, Integer>> bugTable = 
new ArrayList<SimpleEntry<String, Integer>>();
        bugTable.add(new SimpleEntry<>("Arthur", 2));
        bugTable.add(new SimpleEntry<>("Kahlil", 8));
        bugTable.add(new SimpleEntry<>("Tom", 5));
        bugTable.add(new SimpleEntry<>("Gyula", 20));
        bugTable.add(new SimpleEntry<>("Arthur", 25));
        bugTable.add(new SimpleEntry<>("Jeff", 17));
        bugTable.add(new SimpleEntry<>("Jeff", 18));
        bugTable.add(new SimpleEntry<>("Tom", 15));

Now, let’s say we want to retrieve the names of the team members and the total number of bugs they have squashed. But let’s add two constraints:

  1. We only sum up entries that have less than 25 bugs squashed, because 25 or higher seems suspiciously high.

  2. We only report team member names with a total of 5 or more bugs squashed.

We can express this fairly easily in SQL:

SELECT name, SUM(bugs_squashed)
FROM bugTable
WHERE bugs_squashed < 25
GROUP BY bugs_squashed
HAVING bugs_squashed >= 5;

We can express this same logic with Java 8 streams.

bugTable.stream()
.filter(se -> se.getValue() < 25)
.collect(Collectors.groupingBy(
        SimpleEntry<String, Integer>::getKey,
        Collectors.summingInt(SimpleEntry<String, Integer>::getValue)))
.entrySet().stream()
.filter(entry -> entry.getValue() >= 5)
.forEach(System.out::println);

You’re probably either thinking that Java 8 streams are amazing, or that you have no idea why anyone would want to use them. Either way, we’ll now go into more details about what’s actually happening.

Detailed explanation

Let’s break it down.

bugTable.stream()
.filter(se -> se.getValue() < 25)

We create a stream from our list of pairs. Then, we filter out all the pairs that have 25 or more bugs squashed.

.collect(collectors.groupingby(
        simpleentry<string, integer>::getkey,
        collectors.summingint(simpleentry<string, integer>::getvalue)))
.entryset().stream()

While seemingly tricky, this isn’t so bad if we look at it piece by piece.

First, collect lets us store our stream back into a Java collections object. By passing in Collectors.groupingBy, we are storing our stream into a map, where we smoosh all pairs with common keys into aggregated entries in our map.

        SimpleEntry<String, Integer>::getKey,

This first argument to Collectors.groupingBy is a function (via Java 8 named function syntax) that defines the grouping key for our final aggregated hash-map as the result of calling getKey on each entry in our stream.

        collectors.summingint(simpleentry<string, integer>::getvalue)))

This second argument to Collectors.groupingBy describes how to aggregate each set of entries in our map that has the same key.

We are saying that we want to add up the number of bugs squashed. And we pass the function SimpleEntry<String, Integer>::getValue to say that the values of our key-value pairs are what we want to sum up.

.entryset().stream()

Finally, this takes our aggregate result map, and turns it back into a stream for further processing.

Admittedly, it’s unfortunate that we have to store our data into a map, then read it back into a stream mid-way through our calculation, but that’s a necessary limitation of using .collect().

.filter(entry -> entry.getValue() >= 5)
.forEach(System.out::println);

The last step straightforward. It filters out any aggregated entries that have less than 5 squashed bugs, then prints each remaining entry out.

Conclusion

Do not worry if some of the details (like collectors) seem confusing. The takeaway here is that Java 8 streams are extremely powerful and versatile.

Feel free to check out the entire example on github.

rss facebook twitter github youtube mail spotify instagram linkedin google pinterest medium