This module should be installed in any webmiddle application, as it provides the machinery to parse JSX, access to the rootContext and other goodies.


Keywords
web, data, integration, extraction, scraper, jsx, framework, data-extraction, jsx-components, modular, nodejs, web-scraping
License
MIT
Install
npm install webmiddle@0.5.1

Documentation

Build Status Coverage Status

webmiddle

Node.js framework for modular web scraping and data extraction

The building block of any webmiddle application is the JSX component.
Each component executes one task or controls the execution of other tasks by composing other components.

const FetchPageLinks({ url, query, name }) = () =>
  <Pipe>
    <HttpRequest contentType="text/html" url={url} />

    {rawHtml =>
      <HtmlToJson name={name} from={rawHtml} content={
        {
          anchors: $$.within("a", $$.pipe(
            $$.filter(el => el.text().toUpperCase().indexOf(query.toUpperCase()) !== -1),
            $$.map({
              url: $$.attr("href"),
              text: $$.getFirst()
            })
          ))
        }
      }/>
    }
  </Pipe>

The framework provides a set of core components for the most common operations, but there is no difference between a core component and a component that you may want to develop yourself.

Webmiddle applications can be quickly turned into REST APIs, allowing remote access via HTTP or WebSocket. Use webmiddle-devtools for running and debugging your components and test them remotely.

Links

Features

Built-in features provided by the core components:

Core packages

Name Description
webmiddle npm version
webmiddle-manager-cookie npm version
webmiddle-component-pipe npm version
webmiddle-component-parallel npm version
webmiddle-component-resume npm version
webmiddle-component-http-request npm version
webmiddle-component-browser npm version
webmiddle-component-cheerio-to-json npm version
webmiddle-component-jsonselect-to-json npm version
webmiddle-server npm version
webmiddle-client npm version

Open source ecosystem

Create your own components and publish them to npm!

One of the main philosophies of the framework is reuse, by creating an ecosystem where components can be published as separate npm modules to be usable in other projects.

NOTE: If you think that a component / feature is so common and general that it should be in the core, open an issue or just do a pull request!

Contributing

This is a monorepo, i.e. all the core components and the main webmiddle package are all in this single repository.

It uses Yarn and Lerna for managing the monorepo, as you might have guessed from the lerna.json file.

Start by installing the root dependencies with:

yarn

Then install all the packages dependencies and link the packages together by running:

yarn run lerna bootstrap

Build all the packages by running:

yarn run build

To run the tests for all the packages at once and get coverage info, execute:

yarn run test

NOTE: make sure to build before running the tests.

NOTE: If you are on Windows, you might need to run the install and bootstrap commands as administrator.

Each package uses the same build / test system.

Once you are inside a package folder, you can build it by running yarn run build or yarn run build:watch (for rebuilding on every change).

Tests use AVA, thus they can be written in modern JavaScript, moreover they will also run concurrently. You can run the tests with yarn run test. To run the tests on every change you can use yarn run test:watch. The latter option is highly recommended while developing, as it also produces a much more detailed output.

For running the same npm command in all the packages, use lerna run, example:

yarn run lerna run build

For running arbitrary commands, use lerna exec, example:

yarn run lerna -- exec -- rm -rf ./node_modules

See Lerna commands for more info.