XML / HTML tables / Web scraping (static HTML)

Data import is a surprisingly large part of any data project. Below is a compact but practical reference of common file types, recommended packages/functions, example code, and real-world tips (encoding, memory, speed, cloud, DBs, troubleshooting).

Quick workspace prep (repeat — but important)

# Check working dir
getwd()

# Set working dir (if you must)
setwd("/path/to/project")

# Start with a clean environment (use with care)
rm(list = ls())

# Helpful: show files
list.files()

Base R vs modern tidyverse / high-performance packages

Base R (read.table, read.csv, read.delim) works everywhere and is flexible.
readr (read_csv, read_delim) — faster, consistent parsing, tidyverse-friendly.
data.table::fread — extremely fast for large CSV-like files.
readxl — best for Excel (no external dependencies).
haven — best for SAS/SPSS/Stata (via ReadStat).
jsonlite — modern JSON parsing.
DBI + specific DB backends — standard interface for relational DBs.
arrow — for Feather/Parquet and interop with Python/Spark; excellent for large data and columnar formats.

CSV / delimited text

Examples

# readr - tidy defaults, prints parsing problems
install.packages("readr")
library(readr)
df <- read_csv("data/sales.csv") # comma
df2 <- read_delim("data/data.txt", "\t") # tab separated

# data.table: blazing fast, automatic type detection
install.packages("data.table")
library(data.table)
dt <- fread("data/sales.csv")

# vroom: loads lazily and is fast for many files
install.packages("vroom")
library(vroom)
df_v <- vroom("data/sales.csv")

Useful arguments

locale = locale(encoding = "UTF-8") — handle encodings
col_types = cols(...) — force column types (avoid mis-parsing)
na = c("", "NA", "NULL") — specify missing tokens
n_max — read only first N rows (quick inspect)

Excel (XLS / XLSX)

Example

install.packages("readxl")
library(readxl)

# read first sheet
df <- read_excel("data/book.xlsx")

# read a specific sheet by name or index
df_sheet3 <- read_excel("data/book.xlsx", sheet = "Sales")
df_sheet2 <- read_excel("data/book.xlsx", sheet = 2)

# read a range
df_range <- read_excel("data/book.xlsx", range = "A1:F100")

JSON

XML / HTML tables / Web scraping (static HTML)

Excel-like & legacy formats (ODS, Google Sheets)

Google Sheets: googlesheets4 (supports OAuth & public sheets)
ODS: readODS package

# Google Sheets (best to follow auth flow)
install.packages("googlesheets4")
library(googlesheets4)
sheet <- read_sheet("https://docs.google.com/spreadsheets/...")

SAS / SPSS / Stata

MATLAB / Octave

R.matlab::readMat() for .mat files
foreign::read.octave() for Octave text data

install.packages("R.matlab")
library(R.matlab)
m <- readMat("data/matrix.mat")

Parquet, Feather, Arrow — modern columnar formats

Use arrow package to read/write Parquet and Feather; very efficient for large datasets and cross-language sharing.

install.packages("arrow")
library(arrow)

# read parquet
tbl <- read_parquet("data/data.parquet")

# write parquet
write_parquet(df, "out/data.parquet")

Benefit: Arrow allows out-of-memory, zero-copy reads, and is ideal for large-scale pipelines.

ODBC / Relational DBs (SQL Server, Postgres, MySQL, etc.)

Recommended approach

DBI + driver package (RPostgres, RMySQL, odbc) — standard interface
pool for connection pooling in apps

Example: Postgres via DBI

install.packages(c("DBI","RPostgres"))
library(DBI)
con <- dbConnect(RPostgres::Postgres(),
dbname = "mydb", host = "db.example.com",
port = 5432, user = "me", password = "pw")

# run a query
df <- dbGetQuery(con, "SELECT * FROM schema.table LIMIT 1000")

# fetch a big table in chunks
res <- dbSendQuery(con, "SELECT * FROM big_table")
chunk <- dbFetch(res, n = 10000)
# iterate until exhausted
dbClearResult(res)

dbDisconnect(con)

ODBC (Windows / DSN)

install.packages("odbc")
library(DBI)
con <- dbConnect(odbc::odbc(), DSN = "my_dsn", UID = "user", PWD = "pw")
dbListTables(con)
dbDisconnect(con)

Tip: Avoid SELECT * on huge tables — use column selection and SQL-side filtering.

Cloud storage (S3, GCS, Azure)

S3: aws.s3, arrow can read directly from S3 URIs
Google Cloud Storage: gcsfs (less common in R), or use arrow/gcs integration
Azure: AzureStor

Example with aws.s3:

install.packages("aws.s3")
library(aws.s3)
# must configure AWS credentials (env vars or ~/.aws/credentials)
s3read_using(FUN = read.csv, object = "s3://my-bucket/data.csv")

arrow::read_parquet("s3://bucket/path/file.parquet") also works when configured correctly.

Reading in chunks / streaming large files

If data is too big to fit in memory:

Use database imports (load into DB and query subsets)
Use data.table::fread() with select= to read only columns
Use vroom + dplyr for delayed reads
For complex chunking: readr::read_lines_chunked() or process via LaF or iotools packages

Example chunked processing with LaF:

install.packages("LaF")
library(LaF)
laf <- laf_open_csv("big.csv", column_types = c("integer","double","string"), nrows = 1e8)
block <- laf[1:100000, ] # read first 100k rows

Parsing dates, times, and types

Parsing types properly avoids nasty surprises later.

Use col_types in readr or fread's colClasses
Use lubridate for robust date parsing

install.packages("lubridate")
library(lubridate)

# convert common formats
df$date <- ymd(df$date_string) # "2023-07-01"
df$datetime <- ymd_hms(df$ts) # "2023-07-01 08:45:00"

Encoding & locale issues

If you see weird characters:

# readr
df <- read_csv("file.csv", locale = locale(encoding = "latin1"))

# base read.table
df <- read.table("file.txt", fileEncoding = "UTF-8")

Check Encoding() and iconv() to convert.

Common import pitfalls and troubleshooting

Headers misaligned / comments: use skip = n, comment = "#", or skip_empty_rows = TRUE.
Inconsistent column types: pre-specify column types (using col_types or colClasses) to avoid type coercion.
NA tokens: set na = c("", "NA", "n/a").
Thousands separators / decimal: specify locale = locale(decimal_mark = ",", grouping_mark = ".").
Memory errors: filter on SQL side, load a subset, or use arrow/Parquet to read columns only.

Quick hacks & productivity tips

Clipboard: quick copy-paste into R (Windows/Mac):

df <- read.table("clipboard", header = TRUE, sep = "\t")

Peek first rows before full import:

readr::read_lines("file.csv", n_max = 10)

Inspect parsing problems with readr::problems() after read_csv().
Use col_types = cols() to force types and prevent guessing mistakes.
Use projects & relative paths (RStudio projects) to avoid brittle setwd().

Recommended import pipeline (small project)

Inspect the file (read_lines, head, file.info()).
Try a small sample read (n_max, skip).
Specify explicit col_types or colClasses.
Load with a high-performance reader (fread/vroom/arrow) for big data.
Convert dates and factor levels with lubridate and forcats.
Persist as Parquet (arrow) if reusing: arrow::write_parquet(df, "data.parquet").

Example: end-to-end — CSV → cleaned parquet

library(data.table)
library(lubridate)
library(arrow)

# fast read
dt <- fread("raw/sales.csv", na.strings = c("", "NA"))

# clean types
dt[, order_date := ymd(order_date)]
dt[, amount := as.numeric(amount)]

# sample check
head(dt)

# persist for faster reads later
write_parquet(dt, "clean/sales.parquet")

Useful packages summary (short cheat-sheet)

TaskPackage / Function

Fast CSV read

data.table::fread()

Tidy read

readr::read_csv()

Excel

readxl::read_excel()

JSON

jsonlite::fromJSON()

SPSS/Stata/SAS

haven::read_sav(), read_dta(), read_sas()

Parquet / Arrow

arrow::read_parquet() / write_parquet()

DB connections

DBI + RPostgres / odbc

aws.s3, arrow

Google Sheets

googlesheets4

HTML tables

rvest::html_table()

Chunked reading

LaF, readr::read_lines_chunked()

Labelled variables

labelled

Final notes & next steps

Choose the right tool for your dataset size: fread/vroom/arrow for big files; readr/readxl for comfort and parsing diagnostics.
Preserve raw files and create a reproducible import script (save col_types, date parsing logic).
For production pipelines, prefer Parquet/Arrow or a DB as canonical storage (fast, typed, efficient).
If you want, I can convert any of your examples above into a runnable RMarkdown notebook with sample files and tests.

At Perceptive Analytics, our mission is “to enable businesses to unlock value in data.” For over 20 years, we’ve partnered with more than 100 clients—from Fortune 500 companies to mid-sized firms—to solve complex data analytics challenges. Our services include tableau consultancy and advanced analytics consulting, turning data into strategic insight. We would love to talk to you. Do reach out to us.

XML / HTML tables / Web scraping (static HTML)

Quick workspace prep (repeat — but important)

Base R vs modern tidyverse / high-performance packages

CSV / delimited text

Recommended

Examples

Useful arguments

Excel (XLS / XLSX)

Recommended

Example

JSON

Recommended

XML / HTML tables / Web scraping (static HTML)

Recommended

Excel-like & legacy formats (ODS, Google Sheets)

SAS / SPSS / Stata

Recommended

MATLAB / Octave

Parquet, Feather, Arrow — modern columnar formats

ODBC / Relational DBs (SQL Server, Postgres, MySQL, etc.)

Recommended approach

Example: Postgres via DBI

ODBC (Windows / DSN)

Cloud storage (S3, GCS, Azure)

Reading in chunks / streaming large files

Parsing dates, times, and types

Encoding & locale issues

Common import pitfalls and troubleshooting

Quick hacks & productivity tips

Recommended import pipeline (small project)

Example: end-to-end — CSV → cleaned parquet

Useful packages summary (short cheat-sheet)

Final notes & next steps

Comments

Post a Comment

Popular posts from this blog

Scaling Analytics in the Cloud

Understanding the Curse of Dimensionality

Imputation Techniques for Missing Data in R