Project

ScraperBot

Multi-threaded web scraper with Discord integration for real-time marketplace monitoring

Overview

A JavaFX-based web scraping application that monitors multiple online marketplaces for new listings matching user-defined keywords. Features Selenium WebDriver automation, concurrent task execution, and Discord notifications.

Highlights

  • Multi-site concurrent scraping with Selenium WebDriver and Firefox GeckoDriver automation
  • Discord bot integration with JDA for real-time listing notifications with embedded images and links
  • JavaFX GUI with task management, headless mode, exact search filtering, and turbo mode
  • Custom hash table autocompleter for efficient listing search and filtering
  • Dashboard with price filtering, sorting options, and grid/list view with image loading

Tech Stack

JavaJavaFXSelenium WebDriverJDA (Discord API)MultithreadingWeb ScrapingHTML ParsingGUI

System Architecture

  • BaseTaskDriver: Abstract base class implementing core scraping logic with WebDriver setup, item processing, and thread management
  • Site-specific drivers: eBayDriver, FacebookUSDriver, MercariJPDriver, RakumaDriver, YahooDriver, PayPayDriver, JoongnaDriver, HardOffDriver, MercariUSDriver
  • Controller: JavaFX controller managing GUI interactions, task lifecycle, and dashboard filtering
  • DiscordBot: Singleton JDA-based Discord integration sending embedded notifications with site-specific branding
  • Task: Model class representing scraping tasks with title, site, keywords, refresh interval, and exact search settings
  • FileIO: Utility class for persistent storage of tasks, settings, and listing data
  • HashTable: Custom AutoCompleter implementation using HashSet for efficient prefix matching

Task Management Interface

The tasks page displays all configured scraping tasks with their status, site, keywords, and refresh interval. Each task shows a real-time status indicator (green for running, red for stopped) and provides Start, Stop, Edit, and Delete buttons for task control. The interface supports batch operations including Start All Tasks and End All Tasks (force-killing browser processes). Users can edit existing tasks to modify keywords, refresh rates, or switch between sites. The task list persists to file and automatically reloads on application restart.

Task management interface showing multiple scraping tasks with status indicators

Task management interface showing multiple scraping tasks with status indicators

Site Selection

Users select from 9 supported marketplaces plus an 'All Sites' option that spawns concurrent drivers for all platforms. Each site button displays the platform's logo for quick visual identification. The interface validates that custom links match the selected site's domain structure. Supported sites include: Mercari JP (buyee.jp/mercari), Rakuma (buyee.jp/rakuma), Yahoo Japan Auction (buyee.jp/yahoo), PayPay Flea Market (buyee.jp/paypayfleamarket), eBay (ebay.com), Facebook Marketplace (facebook.com/marketplace), Mercari US (mercari.com), Joonggonara (web.joongna.com), and HardOFF (netmall.hardoff.co.jp).

Site selection grid showing supported marketplace platforms

Site selection grid showing supported marketplace platforms

Search Configuration

Task configuration interface with keyword input, refresh settings, and search options

Task configuration interface with keyword input, refresh settings, and search options

After selecting a site, users configure search parameters including task title (for organization), comma-separated keywords, optional direct link (for pre-filtered searches), and refresh interval in milliseconds. The exact search checkbox (for eBay) wraps keywords in quotes for phrase matching. Headless mode runs browsers without visible windows for reduced resource usage. Turbo Mode allows rapid task creation without returning to the main menu. Link validation ensures URLs match the selected site's domain. The confirm button creates the task and adds it to the persistent task list.

Dashboard & Filtering

The dashboard provides comprehensive listing management with multi-criteria filtering: site selection (All or specific marketplace), title search with partial matching, min/max price range filtering, and sorting options (price lowest/highest, time newest/oldest, alphabetical A-Z/Z-A). The grid view displays up to 256 listings with images, titles, prices, and 'View Listing' links that open in the default browser. The list view shows tabular data with site, title, price, and link columns. Images load asynchronously with WebP format support through TwelveMonkeys ImageIO. A status bar at the bottom tracks active Firefox and GeckoDriver process counts for monitoring system resource usage.

Dashboard showing filtered listings with grid layout and search controls

Dashboard showing filtered listings with grid layout and search controls

Web Scraping Implementation

  • Selenium WebDriver with Firefox GeckoDriver for browser automation and headless operation
  • Dynamic element waiting with WebDriverWait and ExpectedConditions for asynchronous page loads
  • XPath and CSS selector-based element extraction for titles, prices, images, and links
  • Iframe detection and switching for sites using embedded search result frames
  • Duplicate detection using HashSet to track seen item IDs and prevent redundant notifications
  • Thread.sleep-based refresh intervals with configurable timing per task
  • Process cleanup handling for graceful shutdown and force-killing stuck browser instances
  • Link validation and ID extraction using regex patterns specific to each marketplace

Discord Integration

The DiscordBot class implements a singleton pattern with JDA (Java Discord API) for real-time notifications. When new listings are detected, the bot sends embedded messages with site-specific branding: platform logo as author icon, title, price, clickable 'View Listing' link, thumbnail image, and color-coded embeds (eBay green, Facebook blue, Mercari red, etc.). The bot initializes on application startup using token and channel ID from persistent settings. Notifications only send for items not previously seen, ensuring users receive alerts only for genuinely new listings. The integration supports Guild Messages and Message Content gateway intents.

Concurrency & Performance

  • Multi-threaded task execution using Java Thread API with ConcurrentHashMap for tracking running threads
  • Independent WebDriver instances per task to prevent cross-contamination and blocking
  • 'All Sites' mode spawns 9 concurrent threads, one per marketplace, for parallel scraping
  • Graceful shutdown with thread interruption, driver.quit(), and thread.join() synchronization
  • Status bar updates every 5 seconds using ScheduledExecutorService for non-blocking UI refresh
  • Image loading with JavaFX SwingFXUtils and BufferedImage for async image processing
  • HashSet-based duplicate detection with O(1) lookup for efficient listing tracking
  • Process count monitoring using WMIC queries to track Firefox and GeckoDriver instances

Data Persistence

The FileIO utility class manages persistent storage for three data types: tasks (title, site, keywords, link, refresh, status, exactSearch), settings (Discord token, channel ID, headless mode), and listings (site, title, price, link, imageURL). All data uses pipe-delimited text format for human readability and easy debugging. The allListingsInfo.txt file accumulates all scraped listings across sessions, enabling historical search and trend analysis. Directory creation with mkdirs() ensures file paths exist before writing. Tasks and settings reload on application restart, preserving user configuration between sessions.

Key Features & Design Decisions

  • Abstract BaseTaskDriver enables easy addition of new marketplace sites with minimal code duplication
  • Site-specific drivers implement extractItemLink(), extractItemId(), extractImageUrl() for platform-specific HTML parsing
  • Exact search filtering (titleContainsAllWords) supports multi-word queries requiring all terms to match
  • Turbo Mode accelerates task creation for power users setting up multiple similar tasks
  • Custom window controls (close, minimize, maximize) with draggable toolbar for frameless JavaFX window
  • Grid layout with 4 columns and dynamic row allocation for responsive listing display
  • Status text updates show real-time process counts for debugging stuck drivers and resource leaks
  • Clear Dashboard button wipes allListingsInfo.txt for fresh data collection

Challenges & Solutions

Implemented iframe detection and switching for Buyee proxy sites wrapping Japanese marketplaces. Added timeout handling for slow-loading pages with 30-second WebDriverWait limits. Solved image loading issues with TwelveMonkeys WebP support for modern image formats. Implemented process monitoring to detect stuck browser instances and provide End All Tasks emergency cleanup. Handled concurrent modification exceptions by using ConcurrentHashMap and HashSet copies during iteration. Added link validation to prevent user errors when pasting incompatible URLs. Implemented graceful thread interruption with 5-second delay before stopping tasks to allow current page loads to complete.

Results & Impact

Successfully monitors 9 online marketplaces simultaneously with real-time Discord notifications. Demonstrated proficiency in JavaFX GUI design, Selenium automation, multithreaded programming, and external API integration. The autocompleter HashTable implementation showcases data structure knowledge with O(1) exact matching and efficient prefix search. The project handles concurrent WebDriver instances without cross-contamination, showcasing proper resource management and thread safety. Persistent storage ensures tasks and settings survive application restarts. The dashboard filtering system enables quick identification of relevant listings from thousands of entries. Discord integration provides instant mobile notifications for time-sensitive deals.