by Patrick ➕follow (60) 💰tip ignore
« First « Previous Comments 292 - 331 of 543 Next » Last » Search these comments
// npm install tesseract
// fyi I used node 18.12.0
const Tesseract = require('tesseract.js')
let filename = 'npr.png'
Tesseract.recognize(filename)
.catch(err => console.error(err))
.then(function (result) {
console.log(result)
console.log(result.data.text)
process.exit(0)
})
m K@su L SIGNIN i NPR SHOP
[E] NEWS X CULTURE J MUSIC () PODCASTS & SHOWS Q SEARCH >
TECHNOLOGY
4] . .
Elon Musk said Twitter wouldn't become
v a 'hellscape.’ It's already changing
E October 31}, 2022 - 4:'?9 PMET
‘v‘ gs Considered
fl SHANNON BOND
CREATE TABLE image_words (
post_id INT,
file_name VARCHAR(255) NOT NULL,
words VARCHAR(20000),
created_at DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
most_recent_ocr_attempt DATETIME,
most_recent_ocr_success DATETIME,
FULLTEXT KEY (words),
-- I'm not sure which of the following indexes would work best w/ mysql for the query, but one of them should work well I think.
INDEX dates1 (most_recent_ocr_success, most_recent_ocr_attempt, created_at),
INDEX dates2 (most_recent_ocr_attempt, created_at)
);
-- Have a background worker run this query and process the results every X minutes.
-- Find rows that have not been successfully processed yet.
-- If an attempt was made on a row, but it failed, we try to process it again but not for at least a week.
-- The reason we process newest first is a similar reason for why we wait 1 week before reattempting
-- a failed row - this query+table is basically a queue, and we want to make sure we don't eventually clog the head of the queue up
-- with stuff that keeps failing over and over, which might prevent the worker from ever consuming fresh work that it
-- can succeed with. Let it try the new stuff first, and then if it has spare time, it can reattempt failed stuff.
select *
from image_words
where most_recent_ocr_success is null
and (
most_recent_ocr_attempt < CURRENT_TIMESTAMP - interval 1 week
or
most_recent_ocr_attempt is null
)
order
by created_at desc
limit 100
I tried out tesseract today (an ocr lib) and it was easy to use. I was thinking maybe you could use it to make the text in images searchable. Tags still fill the void when the image doesn't contain any relevant text, but many images on here contain headlines, and so might be useful if they were searchable.
npm ERR! gyp info spawn make
npm ERR! gyp info spawn args [ 'BUILDTYPE=Release', '-C', 'build' ]
npm ERR! ../src/tesseract_bindings.cc:7: warning: "BUILDING_NODE_EXTENSION" redefined
npm ERR! 7 | #define BUILDING_NODE_EXTENSION
npm ERR! |
npm ERR! : note: this is the location of the previous definition
npm ERR! In file included from ../src/tesseract_bindings.cc:9:
npm ERR! ../src/tesseract_baseapi.h:10:10: fatal error: baseapi.h: No such file or directory
npm ERR! 10 | #include
npm ERR! | ^~~~~~~~~~~
npm ERR! compilation terminated.
npm ERR! make: * [tesseract_bindings.target.mk:125: Release/obj.target/tesseract_bindings/src/tesseract_bindings.o] Error 1
npm ERR! gyp ERR! build error
npm ERR! gyp ERR! stack Error: `make` failed with exit code: 2
npm ERR! gyp ERR! stack at ChildProcess.onExit (/usr/lib/node_modules/npm/node_modules/node-gyp/lib/build.js:194:23)
npm ERR! gyp ERR! stack at ChildProcess.emit (node:events:390:28)
npm ERR! gyp ERR! stack at Process.ChildProcess._handle.onexit (node:internal/child_process:290:12)
npm ERR! gyp ERR! System Linux 5.10.0-14-amd64
npm ERR! gyp ERR! command "/usr/bin/node" "/usr/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild"
npm ERR! gyp ERR! cwd /home/patrick/webfam.net/server/node_modules/tesseract
npm ERR! gyp ERR! node -v v17.2.0
npm ERR! gyp ERR! node-gyp -v v8.4.0
npm ERR! gyp ERR! not ok
You can search for any of them by name, but you'd like some kind of search that lists them all at once, right?
Ballotopedia has a lot of the facts:
Yes a searchable section dedicated to politicians, by office and party.
Perhaps on down the road, then I would like to be able make api calls to your Critter Catcher, you could become the go to source for such crowd sourced repository. Like the Craigslist, Wiki, or Facebook of who's who elected officials and would be candidates.
I have two suggestions:
1) I want the ability to tag a comment do I can refer back to it. Some things take days or weeks to research. I can bookmark them (and do), but it would be nice to have the ability to tag/untag a comment
2) It would be nice to have a list of responses to what I've written. I know this is sent in email, but I think it would be better to have a pull down on responses you need to take note of.
richwicks says
I have two suggestions:
1) I want the ability to tag a comment do I can refer back to it. Some things take days or weeks to research. I can bookmark them (and do), but it would be nice to have the ability to tag/untag a comment
2) It would be nice to have a list of responses to what I've written. I know this is sent in email, but I think it would be better to have a pull down on responses you need to take note of.
like favorites? or just searchable tags?
Yes a searchable section dedicated to politicians, by office and party.
Searchable tags would be nice, but I think that's overkill. Say you can tag a comment with a name, and search based on that..
Tenpoundbass says
Yes a searchable section dedicated to politicians, by office and party.
Tenpoundbass OK, how about this:
https://patrick.net/post/1377838/2022-11-29-us-congressmen-lists
richwicks OK, something like the existing "pinned" list for threads:
https://patrick.net/pinned?a=Patrick
But for comments instead, right?
It wouldn't say why you pinned a comment, but hopefully you'd remember why.
OK, it's kind of a good time for this, since I was planning to merge the database tables for original threads and comments so that search could be unified. Right now there are separate tables and it's a pain.
After they are unified, then there can be just one kind of search, and one kind of pinning.
I do have a reasonable daily backup system I think. Worst case, one day is lost. Eventually I hope to have all comments immediately mirrored to a different server, but that's time, money, etc.
For the db migration, I plan to take every thread, make the text of it a comment, and have that thread just be a kind of skeleton which just has a bit of metadata and points to the original post content in the comments table.
So everything is going to be a comment. A "thread" will just be a wrapper around the first comment, which is the original post.
The whole site should look and work pretty much the same, except that search and pinning will be just one thing instead of two.
Thanks, but I've got this, no problem.
« First « Previous Comments 292 - 331 of 543 Next » Last » Search these comments
patrick.net
An Antidote to Corporate Media
1,259,739 comments by 15,039 users - Patrick online now