|
| 1 | +# Codebase improvement plan |
| 2 | + |
| 3 | +This document summarizes high-impact opportunities to improve maintainability, reliability, and release quality for `scholarly`. |
| 4 | + |
| 5 | +## 1) Modernize packaging metadata (high impact, low-medium effort) |
| 6 | + |
| 7 | +### Current state |
| 8 | +- `pyproject.toml` only declares the build backend, while package metadata and dependencies are still maintained in `setup.py`. |
| 9 | +- `setup.py` still marks the package as `Development Status :: 3 - Alpha`. |
| 10 | +- Test configuration uses a legacy `test_suite` setting. |
| 11 | + |
| 12 | +### Why improve |
| 13 | +- Keeping metadata in a single source of truth (`pyproject.toml` under PEP 621) reduces drift between tools. |
| 14 | +- A modern packaging layout improves compatibility with build tooling and dependency scanners. |
| 15 | + |
| 16 | +### Recommended actions |
| 17 | +1. Move project metadata and dependencies from `setup.py` into `pyproject.toml` (`[project]`, `[project.optional-dependencies]`). |
| 18 | +2. Keep `setup.py` as a thin compatibility shim (or remove when downstream tooling allows). |
| 19 | +3. Add tool config blocks for formatter/linter/test runner (`[tool.pytest.ini_options]`, etc.) if adopted. |
| 20 | +4. Revisit package classifier maturity to match actual support expectations. |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +## 2) Split online integration tests from deterministic unit tests (high impact, medium effort) |
| 25 | + |
| 26 | +### Current state |
| 27 | +- `test_module.py` performs many live requests against Google Scholar and free-proxy providers. |
| 28 | +- The default setup path calls `FreeProxies()` and can fail in CI/restricted networks with proxy errors. |
| 29 | + |
| 30 | +### Why improve |
| 31 | +- Network and anti-bot behavior make these tests flaky and environment-dependent. |
| 32 | +- Contributors cannot reliably run tests offline. |
| 33 | + |
| 34 | +### Recommended actions |
| 35 | +1. Introduce explicit test markers/categories: |
| 36 | + - `unit` (pure parsing/logic, deterministic) |
| 37 | + - `integration` (live network) |
| 38 | +2. Move parsing assertions to fixture-based tests with stored HTML snippets. |
| 39 | +3. Gate live tests behind an opt-in env var (for example `SCHOLARLY_RUN_LIVE_TESTS=1`) and skip by default. |
| 40 | +4. Add CI matrix stages for quick deterministic checks vs scheduled/optional live checks. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## 3) Tighten exception boundaries and diagnostics around proxy/navigation (high impact, medium effort) |
| 45 | + |
| 46 | +### Current state |
| 47 | +- Navigation/proxy code contains many broad `except Exception` handlers. |
| 48 | +- Retries are spread across navigator and proxy layers with mixed logging detail. |
| 49 | + |
| 50 | +### Why improve |
| 51 | +- Broad exception handling can hide root causes. |
| 52 | +- Harder to reason about retry policy and failure modes across multiple connection paths. |
| 53 | + |
| 54 | +### Recommended actions |
| 55 | +1. Replace generic catches with specific exception classes where feasible (`requests`, `httpx`, Selenium, parsing errors). |
| 56 | +2. Standardize retry policy in one place (max attempts, backoff, timeout growth). |
| 57 | +3. Return/raise structured errors that preserve context (request URL, proxy mode, status code). |
| 58 | +4. Add focused unit tests for retry/fallback paths with mocked sessions. |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## 4) Improve API typing and static analysis coverage (medium impact, medium effort) |
| 63 | + |
| 64 | +### Current state |
| 65 | +- Some modules have selective type hints; many public methods still have partial/no type coverage. |
| 66 | +- Dynamic dictionary payloads are common and can be error-prone. |
| 67 | + |
| 68 | +### Why improve |
| 69 | +- Better typing helps contributors navigate the code and reduces regressions. |
| 70 | +- Static analysis can catch interface mismatches early. |
| 71 | + |
| 72 | +### Recommended actions |
| 73 | +1. Incrementally type annotate public methods in `_scholarly.py`, `_navigator.py`, and parsers. |
| 74 | +2. Introduce `TypedDict`/dataclass models for frequently used payload shapes. |
| 75 | +3. Add `mypy` (or pyright) in non-blocking mode first, then tighten gradually. |
| 76 | + |
| 77 | +--- |
| 78 | + |
| 79 | +## 5) Documentation and contributor experience upgrades (medium impact, low effort) |
| 80 | + |
| 81 | +### Current state |
| 82 | +- README and docs are comprehensive but test execution guidance assumes network-dependent flows. |
| 83 | +- Contributor guidance could more clearly separate quick local checks from full integration validation. |
| 84 | + |
| 85 | +### Why improve |
| 86 | +- Faster onboarding and fewer false-negative test runs for external contributors. |
| 87 | + |
| 88 | +### Recommended actions |
| 89 | +1. Add a short “Local development quick checks” section with deterministic commands. |
| 90 | +2. Document required environment variables/services for each integration test category. |
| 91 | +3. Add troubleshooting notes for common proxy/network failures. |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +## Quick wins already applied in this branch |
| 96 | + |
| 97 | +- Replaced two identity assertions (`assertIs(..., 0)`) with value assertions (`assertEqual(..., 0)`) in tests. |
| 98 | +- Fixed a typo in a test docstring (`fiile` -> `file`). |
| 99 | + |
| 100 | +These are small quality improvements that reduce brittle test style and clean up minor docs noise. |
0 commit comments