Skip to content

Commit b9bc180

Browse files
authored
Merge pull request #582 from scholarly-python-package/codex/analyze-and-improve-codebase
Add codebase improvement roadmap and fix brittle test assertions
2 parents 47861a3 + febc925 commit b9bc180

2 files changed

Lines changed: 103 additions & 3 deletions

File tree

docs/codebase_improvements.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Codebase improvement plan
2+
3+
This document summarizes high-impact opportunities to improve maintainability, reliability, and release quality for `scholarly`.
4+
5+
## 1) Modernize packaging metadata (high impact, low-medium effort)
6+
7+
### Current state
8+
- `pyproject.toml` only declares the build backend, while package metadata and dependencies are still maintained in `setup.py`.
9+
- `setup.py` still marks the package as `Development Status :: 3 - Alpha`.
10+
- Test configuration uses a legacy `test_suite` setting.
11+
12+
### Why improve
13+
- Keeping metadata in a single source of truth (`pyproject.toml` under PEP 621) reduces drift between tools.
14+
- A modern packaging layout improves compatibility with build tooling and dependency scanners.
15+
16+
### Recommended actions
17+
1. Move project metadata and dependencies from `setup.py` into `pyproject.toml` (`[project]`, `[project.optional-dependencies]`).
18+
2. Keep `setup.py` as a thin compatibility shim (or remove when downstream tooling allows).
19+
3. Add tool config blocks for formatter/linter/test runner (`[tool.pytest.ini_options]`, etc.) if adopted.
20+
4. Revisit package classifier maturity to match actual support expectations.
21+
22+
---
23+
24+
## 2) Split online integration tests from deterministic unit tests (high impact, medium effort)
25+
26+
### Current state
27+
- `test_module.py` performs many live requests against Google Scholar and free-proxy providers.
28+
- The default setup path calls `FreeProxies()` and can fail in CI/restricted networks with proxy errors.
29+
30+
### Why improve
31+
- Network and anti-bot behavior make these tests flaky and environment-dependent.
32+
- Contributors cannot reliably run tests offline.
33+
34+
### Recommended actions
35+
1. Introduce explicit test markers/categories:
36+
- `unit` (pure parsing/logic, deterministic)
37+
- `integration` (live network)
38+
2. Move parsing assertions to fixture-based tests with stored HTML snippets.
39+
3. Gate live tests behind an opt-in env var (for example `SCHOLARLY_RUN_LIVE_TESTS=1`) and skip by default.
40+
4. Add CI matrix stages for quick deterministic checks vs scheduled/optional live checks.
41+
42+
---
43+
44+
## 3) Tighten exception boundaries and diagnostics around proxy/navigation (high impact, medium effort)
45+
46+
### Current state
47+
- Navigation/proxy code contains many broad `except Exception` handlers.
48+
- Retries are spread across navigator and proxy layers with mixed logging detail.
49+
50+
### Why improve
51+
- Broad exception handling can hide root causes.
52+
- Harder to reason about retry policy and failure modes across multiple connection paths.
53+
54+
### Recommended actions
55+
1. Replace generic catches with specific exception classes where feasible (`requests`, `httpx`, Selenium, parsing errors).
56+
2. Standardize retry policy in one place (max attempts, backoff, timeout growth).
57+
3. Return/raise structured errors that preserve context (request URL, proxy mode, status code).
58+
4. Add focused unit tests for retry/fallback paths with mocked sessions.
59+
60+
---
61+
62+
## 4) Improve API typing and static analysis coverage (medium impact, medium effort)
63+
64+
### Current state
65+
- Some modules have selective type hints; many public methods still have partial/no type coverage.
66+
- Dynamic dictionary payloads are common and can be error-prone.
67+
68+
### Why improve
69+
- Better typing helps contributors navigate the code and reduces regressions.
70+
- Static analysis can catch interface mismatches early.
71+
72+
### Recommended actions
73+
1. Incrementally type annotate public methods in `_scholarly.py`, `_navigator.py`, and parsers.
74+
2. Introduce `TypedDict`/dataclass models for frequently used payload shapes.
75+
3. Add `mypy` (or pyright) in non-blocking mode first, then tighten gradually.
76+
77+
---
78+
79+
## 5) Documentation and contributor experience upgrades (medium impact, low effort)
80+
81+
### Current state
82+
- README and docs are comprehensive but test execution guidance assumes network-dependent flows.
83+
- Contributor guidance could more clearly separate quick local checks from full integration validation.
84+
85+
### Why improve
86+
- Faster onboarding and fewer false-negative test runs for external contributors.
87+
88+
### Recommended actions
89+
1. Add a short “Local development quick checks” section with deterministic commands.
90+
2. Document required environment variables/services for each integration test category.
91+
3. Add troubleshooting notes for common proxy/network failures.
92+
93+
---
94+
95+
## Quick wins already applied in this branch
96+
97+
- Replaced two identity assertions (`assertIs(..., 0)`) with value assertions (`assertEqual(..., 0)`) in tests.
98+
- Fixed a typo in a test docstring (`fiile` -> `file`).
99+
100+
These are small quality improvements that reduce brittle test style and clean up minor docs noise.

test_module.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ def setUpClass(cls):
100100
@classmethod
101101
def tearDownClass(cls):
102102
"""
103-
Clean up the mandates csv fiile downloaded.
103+
Clean up the mandates csv file downloaded.
104104
"""
105105
if os.path.exists(cls.mandates_filename):
106106
os.remove(cls.mandates_filename)
@@ -121,7 +121,7 @@ def test_search_author_empty_author(self):
121121
Test that sholarly.search_author('') returns no authors
122122
"""
123123
authors = [a for a in scholarly.search_author('')]
124-
self.assertIs(len(authors), 0)
124+
self.assertEqual(len(authors), 0)
125125

126126
def test_search_keywords(self):
127127
query = scholarly.search_keywords(['crowdsourcing', 'privacy'])
@@ -653,7 +653,7 @@ def test_search_pubs_empty_publication(self):
653653
Test that searching for an empty publication returns zero results
654654
"""
655655
pubs = [p for p in scholarly.search_pubs('')]
656-
self.assertIs(len(pubs), 0)
656+
self.assertEqual(len(pubs), 0)
657657

658658
def test_search_pubs_citedby(self):
659659
"""

0 commit comments

Comments
 (0)